Python库可以解析regex到AST吗?

时间:2022-06-18 14:51:00

To emphasize, I do not want to "parse using a regex" - I want to "parse a regex into a symbolic tree." (Searching has only brought up the former...)

要强调的是,我不想“使用regex进行解析”——我想“将regex解析为一个符号树”。(搜索只带来了前者……)

My use case: to speed up a regex search over a database, I'd like to parse a regex like (foo|bar)baz+(bat)* and pull out all substrings that MUST appear in a match. (In this case, it's just baz because foo/bar are alternations and bat can appear 0 times.)

我的用例:为了加快对数据库的regex搜索,我希望解析一个regex,比如(foo|bar)baz+(bat)*,并取出匹配中必须出现的所有子字符串。(在这种情况下,它只是baz,因为foo/bar是交替的,bat可以出现0次。)

To do this, I need some understanding of regex operators/semantics. re.DEBUG comes closest:

为此,我需要了解regex操作符/语义。re.DEBUG亲密:

In [7]: re.compile('(foo|bar)baz+(bat)', re.DEBUG)
subpattern 1
  branch
    literal 102
    literal 111
    literal 111
  or
    literal 98
    literal 97
    literal 114
literal 98
literal 97
max_repeat 1 4294967295
  literal 122
subpattern 2
  literal 98
  literal 97
  literal 116

However, it's just printing out, and the c-implementation doesn't preserve the structure afterwards as far as I can tell. Any ideas on how I can parse this out without writing my owner parser?

但是,它只是打印出来,而c实现并没有保留结构。关于如何在不编写所有者解析器的情况下解析它,您有什么想法吗?

1 个解决方案

#1


2  

You can only specify a (classic) regex using a context free grammar:

您只能使用上下文*语法指定(classic) regex:

 regex = { alternatives };
 alternatives =  primitive { '|' alternatives } ;
 primitive = '(' regex ')' | '[' character_set ']' | ...

This means you can't parse a regex using a regex (Perl is an exception, but then its "regexes" are extended way beyond "classic").

这意味着您不能使用regex解析regex (Perl是一个例外,但是它的“regex”被扩展到“classic”之外)。

So, to parse a regex, you'll need to build your own parser and constructs some kind of tree (re.Debug comes pretty close) or that magic library you are hoping for.

因此,要解析regex,您需要构建自己的解析器并构造某种树(re.Debug非常接近)或您希望的魔法库。

I suspect this is the easy part. This isn't terribly hard to do yourself; see Is there an alternative for flex/bison that is usable on 8-bit embedded systems? for a straightforward scheme for building such parsers.

我怀疑这是最简单的部分。这对你自己来说并不难;在8位嵌入式系统中是否有可用于flex/bison的替代方案?对于构建此类解析器的简单方案。

To understand the semantics of the regex (e.g., to figure out "necessary substrings"), you might be able to get away with building an analyzer the walks over the parse tree, and for each subtree (bottom up), computes the common-string. Failing that you may have to implement the classic NDFA construction and then walk over it, or implement the NDFA to DFA construction and walk over the DFA. Real regexes tend to contain lots of messy complications such as built-in character sets, capture groups, etc.

为了理解regex的语义(例如,为了找出“必要的子字符串”),您可以构建一个分析器,在解析树上遍历,并对每个子树(自下而上)计算公共字符串。否则,您可能需要执行经典的NDFA结构,然后遍历它,或者执行NDFA到DFA结构,然后遍历DFA。真正的正则表达式往往包含许多复杂的问题,比如内置的字符集、捕获组等等。

The "common string" might not be just a contiguous sequence of characters although you could define it narrowly as such. It might include several constant substrings separated by fixed or variable length gaps of characters, e.g., your necessary substring might always itself be expressible as a "simple regex" of the form:

“公共字符串”可能不只是一个连续的字符序列,尽管您可以将其定义为这样。它可能包含几个常量子字符串,由固定的或可变的字符长度间隔分隔,例如,您所需要的子字符串本身可能总是可以表示为表单的“简单regex”:

   (<character>+ ?+) <character>+

#1


2  

You can only specify a (classic) regex using a context free grammar:

您只能使用上下文*语法指定(classic) regex:

 regex = { alternatives };
 alternatives =  primitive { '|' alternatives } ;
 primitive = '(' regex ')' | '[' character_set ']' | ...

This means you can't parse a regex using a regex (Perl is an exception, but then its "regexes" are extended way beyond "classic").

这意味着您不能使用regex解析regex (Perl是一个例外,但是它的“regex”被扩展到“classic”之外)。

So, to parse a regex, you'll need to build your own parser and constructs some kind of tree (re.Debug comes pretty close) or that magic library you are hoping for.

因此,要解析regex,您需要构建自己的解析器并构造某种树(re.Debug非常接近)或您希望的魔法库。

I suspect this is the easy part. This isn't terribly hard to do yourself; see Is there an alternative for flex/bison that is usable on 8-bit embedded systems? for a straightforward scheme for building such parsers.

我怀疑这是最简单的部分。这对你自己来说并不难;在8位嵌入式系统中是否有可用于flex/bison的替代方案?对于构建此类解析器的简单方案。

To understand the semantics of the regex (e.g., to figure out "necessary substrings"), you might be able to get away with building an analyzer the walks over the parse tree, and for each subtree (bottom up), computes the common-string. Failing that you may have to implement the classic NDFA construction and then walk over it, or implement the NDFA to DFA construction and walk over the DFA. Real regexes tend to contain lots of messy complications such as built-in character sets, capture groups, etc.

为了理解regex的语义(例如,为了找出“必要的子字符串”),您可以构建一个分析器,在解析树上遍历,并对每个子树(自下而上)计算公共字符串。否则,您可能需要执行经典的NDFA结构,然后遍历它,或者执行NDFA到DFA结构,然后遍历DFA。真正的正则表达式往往包含许多复杂的问题,比如内置的字符集、捕获组等等。

The "common string" might not be just a contiguous sequence of characters although you could define it narrowly as such. It might include several constant substrings separated by fixed or variable length gaps of characters, e.g., your necessary substring might always itself be expressible as a "simple regex" of the form:

“公共字符串”可能不只是一个连续的字符序列,尽管您可以将其定义为这样。它可能包含几个常量子字符串,由固定的或可变的字符长度间隔分隔,例如,您所需要的子字符串本身可能总是可以表示为表单的“简单regex”:

   (<character>+ ?+) <character>+