使用正则表达式匹配和提取数据

时间:2022-09-09 00:29:17

Problem: To find a matching string and to extract data from the matched string. There are a number of command strings which has keywords and data.

问题:查找匹配的字符串并从匹配的字符串中提取数据。有许多命令字符串具有关键字和数据。

Command Examples:

  1. Ask name to call me
  2. 请求姓名给我打电话

  3. Notify name that do this action
  4. 通知执行此操作的名称

  5. Message name that request
  6. 请求的消息名称

Keywords: Ask, Notify, Message, to, that. Data:

关键词:询问,通知,消息,到,那。数据:

Input strings:

  1. Ask peter to call me
  2. 请彼得打电话给我

  3. Notify Jenna that I am going to be away
  4. 通知Jenna我将要离开

  5. Message home that I am running late
  6. 回家告诉我,我迟到了

My problem consists of two problems 1) Find matching command 2) Extract data

我的问题包括两个问题1)查找匹配命令2)提取数据

Here is what I am doing: I create multiple regular expressions: "Ask[[:s:]][[:w:]]+[[:s:]]to[[:s:]][[:w:]]+" or "Ask([^\t\n]+?)to([^\t\n]+?)" "Notify[[:s:]][[:w:]]+[[:s:]]that[[:s:]][[:w:]]+" or "Notify([^\t\n]+?)that([^\t\n]+?)"

以下是我正在做的事情:我创建了多个正则表达式:“问[[:s:]] [[:w:]] + [[:s:]]到[[:s:]] [[:w: ]] +“或”询问([^ \ t \ n] +?)到([^ \ t \ n] +?)“”通知[[:s:]] [[:w:]] + [[ :s:]]那[[:s:]] [[:w:]] +“或”通知([^ \ t \ n] +?)那个([^ \ t \ n] +?)“

void searchExpression(const char *regString)
{
    std::string str;
    boost::regex callRegEx(regString, boost::regex_constants::icase);
    boost::cmatch im;

    while(true) {
       std::cout << "Enter String: ";
       getline(std::cin, str);
       fprintf(stderr, "str %s regstring %s\n", str.c_str(), regString);

       if(boost::regex_search(str.c_str(), im, callRegEx)) {
             int num_var = im.size() + 1;
             fprintf(stderr, "Matched num_var %d\n", num_var);
             for(int j = 0; j <= num_var; j++) {
                    fprintf(stderr, "%d) Found %s\n",j, std::string(im[j]).c_str());
             }
      }
      else {
          fprintf(stderr, "Not Matched\n");
      }
   }
}

I am able to Find a matching string, I am not able to extract the data. Here is the output:

我能够找到匹配的字符串,我无法提取数据。这是输出:

input_string: Ask peter to call Regex Ask[[:s:]][[:w:]]+[[:s:]]to[[:s:]][[:w:]]+
Matched num_var 2
0) Found Ask peter to call
1) Found
2) Found

I would like to extract peter and call from Ask Peter to call.

我想提起彼得并打电话给彼得打电话。

2 个解决方案

#1


4  

Since you're really wanting to parse a grammar, you should consider Boost's parser generator.

既然你真的想要解析语法,你应该考虑Boost的解析器生成器。

You'd simply write the whole thing top-down:

你只需要自上而下地编写整个内容:

auto sentence  = [](auto&& v, auto&& p) { 
    auto verb     = lexeme [ no_case [  as_parser(v) ] ];
    auto name     = lexeme [ +graph ];
    auto particle = lexeme [ no_case [  as_parser(p) ] ];
    return confix(verb, particle) [ name ]; 
};

auto ask     = sentence("ask",     "to")   >> lexeme[+char_];
auto notify  = sentence("notify",  "that") >> lexeme[+char_];
auto message = sentence("message", "that") >> lexeme[+char_];

auto command = ask | notify | message;

This is a Spirit X3 grammar for it. Read lexeme as "keep whole word" (don't ignore spaces).

这是Spirit X3的语法。阅读lexeme作为“保持整个单词”(不要忽略空格)。

Here, "name" is taken to be anything up to the expected particle¹

这里,“名称”被认为是达到预期粒子¹的任何东西

If you just want to return the raw string matched, this is enough:

如果您只想返回匹配的原始字符串,这就足够了:

Live On Coliru

住在科利鲁

#include <iostream>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/directive/confix.hpp>

namespace x3 = boost::spirit::x3;

namespace commands {
    namespace grammar {
        using namespace x3;

        auto sentence  = [](auto&& v, auto&& p) { 
            auto verb     = lexeme [ no_case [  as_parser(v) ] ];
            auto name     = lexeme [ +graph ];
            auto particle = lexeme [ no_case [  as_parser(p) ] ];
            return confix(verb, particle) [ name ]; 
        };

        auto ask     = sentence("ask",     "to")   >> lexeme[+char_];
        auto notify  = sentence("notify",  "that") >> lexeme[+char_];
        auto message = sentence("message", "that") >> lexeme[+char_];

        auto command = ask | notify | message;

        auto parser  = raw [ skip(space) [ command ] ];
    }
}

int main() {
    for (std::string const input : {
            "Ask peter to call me",
            "Notify Jenna that I am going to be away",
            "Message home that I am running late",
            })
    {
        std::string matched;

        if (parse(input.begin(), input.end(), commands::grammar::parser, matched))
            std::cout << "Matched: '" << matched << "'\n";
        else
            std::cout << "No match in '" << input << "'\n";
    }

}

Prints:

Matched: 'Ask peter to call me'
Matched: 'Notify Jenna that I am going to be away'
Matched: 'Message home that I am running late'

BONUS

Of course, you'd actually want to extract the relevant bits of information.

当然,您实际上想要提取相关的信息。

Here's how I'd do that. Let's parse into a struct:

我就是这样做的。让我们解析一个结构:

struct Command {
    enum class Type { ask, message, notify } type;
    std::string name;
    std::string message;
};

And let's write our main() as:

让我们把main()写成:

commands::Command cmd;

if (parse(input.begin(), input.end(), commands::grammar::parser, cmd))
    std::cout << "Matched: " << cmd.type << "|" << cmd.name << "|" << cmd.message << "\n";
else
    std::cout << "No match in '" << input << "'\n";

Live On Coliru

住在科利鲁

#include <iostream>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/directive/confix.hpp>

namespace x3 = boost::spirit::x3;

namespace commands {

    struct Command {
        enum class Type { ask, message, notify } type;
        std::string name;
        std::string message;

        friend std::ostream& operator<<(std::ostream& os, Type t) { return os << static_cast<int>(t); } // TODO
    };

}

BOOST_FUSION_ADAPT_STRUCT(commands::Command, type, name, message)

namespace commands {

    namespace grammar {
        using namespace x3;

        auto sentence  = [](auto type, auto&& v, auto&& p) { 
            auto verb     = lexeme [ no_case [  as_parser(v) ] ];
            auto name     = lexeme [ +graph ];
            auto particle = lexeme [ no_case [  as_parser(p) ] ];
            return attr(type) >> confix(verb, particle) [ name ]; 
        };

        using Type = Command::Type;
        auto ask     = sentence(Type::ask,     "ask",     "to")   >> lexeme[+char_];
        auto notify  = sentence(Type::notify,  "notify",  "that") >> lexeme[+char_];
        auto message = sentence(Type::message, "message", "that") >> lexeme[+char_];

        auto command // = rule<struct command, Command> { }
                     = ask | notify | message;

        auto parser  = skip(space) [ command ];
    }
}

int main() {
    for (std::string const input : {
            "Ask peter to call me",
            "Notify Jenna that I am going to be away",
            "Message home that I am running late",
            })
    {
        commands::Command cmd;

        if (parse(input.begin(), input.end(), commands::grammar::parser, cmd))
            std::cout << "Matched: " << cmd.type << "|" << cmd.name << "|" << cmd.message << "\n";
        else
            std::cout << "No match in '" << input << "'\n";
    }

}

Prints

Matched: 0|peter|call me
Matched: 2|Jenna|I am going to be away
Matched: 1|home|I am running late

¹ I'm no English linguist so I don't know whether that is the correct grammatical term :)

¹我不是英语语言学家,所以我不知道这是否是正确的语法术语:)

#2


2  

This code reads the command strings from the file "commands.txt", searches for the regular expressions and prints the parts whenever there is a match.

此代码从文件“commands.txt”中读取命令字符串,搜索正则表达式并在匹配时打印部件。

#include <iostream>
#include <fstream> 
#include <string>
#include <boost/regex.hpp>

const int NumCmdParts = 4;
std::string CommandPartIds[] = {"Verb", "Name", "Preposition", "Content"};

int main(int argc, char *argv[])
{

    std::ifstream ifs;
    ifs.open ("commands.txt", std::ifstream::in);
    if (!ifs.is_open()) {
      std::cout << "Error opening file commands.txt" << std::endl;
      exit(1);
    }

    std::string cmdStr;

    // Pieces of regular expression pattern
    // '(?<Verb>' : This is to name the capture group as 'Verb'
    std::string VerbPat = "(?<Verb>(Ask)|(Notify|Message))";
    std::string SeparatorPat = "\\s*";  
    std::string NamePat = "(?<Name>\\w+)";

    // Conditional expression. if (Ask) (to) else (that)
    std::string PrepositionPat = "(?<Preposition>(?(2)(to)|(that)))";
    std::string ContentPat = "(?<Content>.*)";

    // Put the pieces together to compose pattern
    std::string TotalPat = VerbPat + SeparatorPat + NamePat + SeparatorPat
                            + PrepositionPat + SeparatorPat + ContentPat;

    boost::regex actions_re(TotalPat);
    boost::smatch action_match;

    while (getline(ifs, cmdStr)) {
        bool IsMatch = boost::regex_search(cmdStr, action_match, actions_re);
        if (IsMatch) {          
          for (int i=1; i <= NumCmdParts; i++) {     
            std::cout << CommandPartIds[i-1] << ": " << action_match[CommandPartIds[i-1]] << "\n";
          }
        }
    }   

    ifs.close();
}

#1


4  

Since you're really wanting to parse a grammar, you should consider Boost's parser generator.

既然你真的想要解析语法,你应该考虑Boost的解析器生成器。

You'd simply write the whole thing top-down:

你只需要自上而下地编写整个内容:

auto sentence  = [](auto&& v, auto&& p) { 
    auto verb     = lexeme [ no_case [  as_parser(v) ] ];
    auto name     = lexeme [ +graph ];
    auto particle = lexeme [ no_case [  as_parser(p) ] ];
    return confix(verb, particle) [ name ]; 
};

auto ask     = sentence("ask",     "to")   >> lexeme[+char_];
auto notify  = sentence("notify",  "that") >> lexeme[+char_];
auto message = sentence("message", "that") >> lexeme[+char_];

auto command = ask | notify | message;

This is a Spirit X3 grammar for it. Read lexeme as "keep whole word" (don't ignore spaces).

这是Spirit X3的语法。阅读lexeme作为“保持整个单词”(不要忽略空格)。

Here, "name" is taken to be anything up to the expected particle¹

这里,“名称”被认为是达到预期粒子¹的任何东西

If you just want to return the raw string matched, this is enough:

如果您只想返回匹配的原始字符串,这就足够了:

Live On Coliru

住在科利鲁

#include <iostream>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/directive/confix.hpp>

namespace x3 = boost::spirit::x3;

namespace commands {
    namespace grammar {
        using namespace x3;

        auto sentence  = [](auto&& v, auto&& p) { 
            auto verb     = lexeme [ no_case [  as_parser(v) ] ];
            auto name     = lexeme [ +graph ];
            auto particle = lexeme [ no_case [  as_parser(p) ] ];
            return confix(verb, particle) [ name ]; 
        };

        auto ask     = sentence("ask",     "to")   >> lexeme[+char_];
        auto notify  = sentence("notify",  "that") >> lexeme[+char_];
        auto message = sentence("message", "that") >> lexeme[+char_];

        auto command = ask | notify | message;

        auto parser  = raw [ skip(space) [ command ] ];
    }
}

int main() {
    for (std::string const input : {
            "Ask peter to call me",
            "Notify Jenna that I am going to be away",
            "Message home that I am running late",
            })
    {
        std::string matched;

        if (parse(input.begin(), input.end(), commands::grammar::parser, matched))
            std::cout << "Matched: '" << matched << "'\n";
        else
            std::cout << "No match in '" << input << "'\n";
    }

}

Prints:

Matched: 'Ask peter to call me'
Matched: 'Notify Jenna that I am going to be away'
Matched: 'Message home that I am running late'

BONUS

Of course, you'd actually want to extract the relevant bits of information.

当然,您实际上想要提取相关的信息。

Here's how I'd do that. Let's parse into a struct:

我就是这样做的。让我们解析一个结构:

struct Command {
    enum class Type { ask, message, notify } type;
    std::string name;
    std::string message;
};

And let's write our main() as:

让我们把main()写成:

commands::Command cmd;

if (parse(input.begin(), input.end(), commands::grammar::parser, cmd))
    std::cout << "Matched: " << cmd.type << "|" << cmd.name << "|" << cmd.message << "\n";
else
    std::cout << "No match in '" << input << "'\n";

Live On Coliru

住在科利鲁

#include <iostream>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/directive/confix.hpp>

namespace x3 = boost::spirit::x3;

namespace commands {

    struct Command {
        enum class Type { ask, message, notify } type;
        std::string name;
        std::string message;

        friend std::ostream& operator<<(std::ostream& os, Type t) { return os << static_cast<int>(t); } // TODO
    };

}

BOOST_FUSION_ADAPT_STRUCT(commands::Command, type, name, message)

namespace commands {

    namespace grammar {
        using namespace x3;

        auto sentence  = [](auto type, auto&& v, auto&& p) { 
            auto verb     = lexeme [ no_case [  as_parser(v) ] ];
            auto name     = lexeme [ +graph ];
            auto particle = lexeme [ no_case [  as_parser(p) ] ];
            return attr(type) >> confix(verb, particle) [ name ]; 
        };

        using Type = Command::Type;
        auto ask     = sentence(Type::ask,     "ask",     "to")   >> lexeme[+char_];
        auto notify  = sentence(Type::notify,  "notify",  "that") >> lexeme[+char_];
        auto message = sentence(Type::message, "message", "that") >> lexeme[+char_];

        auto command // = rule<struct command, Command> { }
                     = ask | notify | message;

        auto parser  = skip(space) [ command ];
    }
}

int main() {
    for (std::string const input : {
            "Ask peter to call me",
            "Notify Jenna that I am going to be away",
            "Message home that I am running late",
            })
    {
        commands::Command cmd;

        if (parse(input.begin(), input.end(), commands::grammar::parser, cmd))
            std::cout << "Matched: " << cmd.type << "|" << cmd.name << "|" << cmd.message << "\n";
        else
            std::cout << "No match in '" << input << "'\n";
    }

}

Prints

Matched: 0|peter|call me
Matched: 2|Jenna|I am going to be away
Matched: 1|home|I am running late

¹ I'm no English linguist so I don't know whether that is the correct grammatical term :)

¹我不是英语语言学家,所以我不知道这是否是正确的语法术语:)

#2


2  

This code reads the command strings from the file "commands.txt", searches for the regular expressions and prints the parts whenever there is a match.

此代码从文件“commands.txt”中读取命令字符串,搜索正则表达式并在匹配时打印部件。

#include <iostream>
#include <fstream> 
#include <string>
#include <boost/regex.hpp>

const int NumCmdParts = 4;
std::string CommandPartIds[] = {"Verb", "Name", "Preposition", "Content"};

int main(int argc, char *argv[])
{

    std::ifstream ifs;
    ifs.open ("commands.txt", std::ifstream::in);
    if (!ifs.is_open()) {
      std::cout << "Error opening file commands.txt" << std::endl;
      exit(1);
    }

    std::string cmdStr;

    // Pieces of regular expression pattern
    // '(?<Verb>' : This is to name the capture group as 'Verb'
    std::string VerbPat = "(?<Verb>(Ask)|(Notify|Message))";
    std::string SeparatorPat = "\\s*";  
    std::string NamePat = "(?<Name>\\w+)";

    // Conditional expression. if (Ask) (to) else (that)
    std::string PrepositionPat = "(?<Preposition>(?(2)(to)|(that)))";
    std::string ContentPat = "(?<Content>.*)";

    // Put the pieces together to compose pattern
    std::string TotalPat = VerbPat + SeparatorPat + NamePat + SeparatorPat
                            + PrepositionPat + SeparatorPat + ContentPat;

    boost::regex actions_re(TotalPat);
    boost::smatch action_match;

    while (getline(ifs, cmdStr)) {
        bool IsMatch = boost::regex_search(cmdStr, action_match, actions_re);
        if (IsMatch) {          
          for (int i=1; i <= NumCmdParts; i++) {     
            std::cout << CommandPartIds[i-1] << ": " << action_match[CommandPartIds[i-1]] << "\n";
          }
        }
    }   

    ifs.close();
}