如何判断一个正则表达式是否与另一个正则表达式的子集匹配?

时间:2021-10-06 05:03:42

I'm just wondering if it's possible to use one regular expression to match another, that is some sort of:

我只是想知道是否有可能用一个正则表达式来匹配另一个正则表达式,那就是:

['a-z'].match(['b-x'])
True

['m-n'].match(['0-9'])
False

Is this sort of thing possible with regex at all? I'm doing work in python, so any advice specific to the re module's implementation would help, but I'll take anything I can get concerning regex.

使用regex是否可能实现这种情况?我正在使用python进行工作,因此任何针对re模块实现的建议都将有所帮助,但是我将接受关于regex的任何建议。

Edit: Ok, some clarification is obviously in order! I definitely know that normal matching syntax would look something like this:

编辑:好的,一些澄清显然是有序的!我知道正常的匹配语法应该是这样的:

expr = re.compile(r'[a-z]*')
string = "some words"
expr.match(string)
<sRE object blah blah>

but I'm wondering if regular expressions have the capability to match other, less specific expressions in the non-syntacticly correct version I tried to explain with above, any letter from b-x would always be a subset (match) of any letter from a-z. I know just from trying that this isn't something you can do by just calling the match of one compiled expression on another compiled expression, but the question remains: is this at all possible?

但是我想知道,如果正则表达式能够匹配我在上面试图解释的非语法正确版本中的其他不那么特定的表达式,那么来自b-x的任何字母总是a-z字母的子集(匹配)。通过尝试,我知道这不是一种可以通过调用一个编译表达式在另一个编译表达式上的匹配来实现的方法,但是问题仍然存在:这是完全可能的吗?

Let me know if this still isn't clear.

如果还不清楚,请告诉我。

7 个解决方案

#1


11  

I think — in theory — to tell whether regexp A matches a subset of what regexp B matches, an algorithm could:

我认为——在理论上——要判断regexp A是否匹配regexp B匹配的子集,算法可以:

  1. Compute the minimal Deterministic Finite Automaton of B and also of the "union" A|B.
  2. 计算B的最小确定性有限自动机和“联合”A|B的最小确定性有限自动机。
  3. Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches.
  4. 检查两个DFAs是否相同。当且仅当A匹配B的子集时,这是成立的。

However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.

然而,在实践中这可能是一个主要的项目。有一些解释,比如从正则表达式构造一个最小状态DFA,但是它们只考虑数学上的纯regexp。您还必须处理Python为方便添加的扩展。此外,如果任何扩展导致语言是非规则的(我不确定是否如此),您可能无法处理这些扩展。

But what are you trying to do? Perhaps there's an easier approach...?

但你想做什么?也许有一个更简单的方法……

#2


6  

In addition to antinome's answer:

除了antinome的回答:

Many of the constructs that are not part of the basic regex definition are still regular, and can be converted after parsing the regex (with a real parser, because the language of regex is not regular itself): (x?) to (x|), (x+) to (xx*), character classes like [a-d] to their corresponding union (a|b|c|d) etc. So one can use these constructs and still test whether one regex matches a subset of the other regex using the DFA comparison mentioned by antinome.

许多结构不基本正则表达式定义的一部分仍然正常,并且可以转换后解析正则表达式(真正的解析器,因为正则表达式本身并不是普通的语言):(x ?)(x |)(x +)(xx *),字符类(模拟)等相应的联盟(c b | | | d)等。所以仍然可以使用这些构造和测试是否一个正则表达式匹配的一个子集其他正则表达式使用DFA antinome提到的比较。

Some constructs, like back references, are not regular, and cannot be represented by NFA or DFA.

有些构造,如后置引用,不是规则的,不能由NFA或DFA表示。

Even the seemingly simple problem of testing whether a regex with back references matches a particular string is NP-complete (http://perl.plover.com/NPC/NPC-3COL.html).

即使是一个看起来简单的问题,即是否使用返回引用的regex与特定的字符串相匹配,也就是NP-complete (http://perl.plover.com/NPC/NPC-3COL.html)。

#3


6  

Verification of the post by "antinome" using two regex : 55* and 5* :

REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]

REGEX_A: 55*[此匹配“5”、“55”、“555”等,不匹配“4”、“54”等]

REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]

REGEX_B: 5*[此匹配"、"5"、"55"、"555"等,且不匹配"4"、"54"等]

[Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]

这里我们假设55*不是隐含的。55。* 5*不是。5。* -这就是为什么5*不匹配4]

REGEX_A can have an NFA as below:

  {A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
           {B} -----------------epsilon --------> {E} 
                          {C} <--- epsilon ------ {E}

REGEX_B can have an NFA as below:

  {A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
  {A} --------------epsilon -----------> {D} 
                 {B} <--- epsilon ------ {D}

Now we can derive NFA * DFA of (REGEX_A|REGEX_B) as below:

  NFA:
  {state A}  ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
                                              {state C} ---epsilon --> {state D} 
                                              {state C} <---epsilon -- {state D}
  {state A}  ---epsilon --> {state E} ---5--> {state F}
                            {state E} ---epsilon --> {state F} 
                            {state E} <---epsilon -- {state F}

  NFA -> DFA:

       |   5          |  epsilon*
   ----+--------------+--------
    A  |  B,C,E,F,G   |   A,C,E,F
    B  |  C,D,E,F     |   B,C,E,F
    c  |  C,D,E,F     |   C
    D  |  C,D,E,F,G   |   C,D,E,F
    E  |  C,D,E,F,G   |   C,E,F
    F  |  C,E,F,G     |   F
    G  |  C,D,E,G     |   C,E,F,G

                    5(epsilon*)
    -------------+---------------------
              A  |  B,C,E,F,G 
      B,C,E,F,G  |  C,D,E,F,G 
      C,D,E,F,G  |  C,D,E,F,G 

    Finally the DFA for (REGEX_A|REGEX_B) is:
         {A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
                                     {C,D,E,F,G}---5--> {C,D,E,F,G}

         Note: {A} is start state and {C,D,E,F,G} is accepting state. 

Similarly DFA for REGEX_A (55*) is:

       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,E  |   A
    B  | C,D,E  |   B,C,E
    C  | C,D,E  |   C
    D  | C,D,E  |   C,D,E
    E  | C,D,E  |   C,E


            5(epsilon*)
   -------+---------------------
       A  |  B,C,E  
   B,C,E  |  C,D,E
   C,D,E  |  C,D,E

    {A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
                                    {C,D,E}--5--->{C,D,E}
Note: {A} is start state and {C,D,E} is accepting state

Similarly DFA for REGEX_B (5*) is:

       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,D  |   A,B,D
    B  | B,C,D  |   B
    C  | B,C,D  |   B,C,D
    D  | B,C,D  |   B,D


            5(epsilon*)
   -------+---------------------
       A  |  B,C,D  
   B,C,D  |  B,C,D

    {A} ---- 5 -----> {B,C,D}
                      {B,C,D} --- 5 ---> {B,C,D}
Note: {A} is start state and {B,C,D} is accepting state

Conclusions:

DFA of REGX_A|REGX_B identical to DFA of REGX_A 
      -- implies REGEX_A is subset of REGEX_B
DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B 
      -- cannot infer about either gerexes.

#4


1  

You should do something along these lines:

你应该做一些类似的事情:

re.match("\[[b-x]-[b-x]\]", "[a-z]")

The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.

正则表达式必须定义字符串应该是什么样子。如果你想匹配一个左方括号,后面跟着一个从b到x的字母,然后是一个破折号,然后是一个从b到x的字母,最后是一个结束方括号,上面的解决方案应该是可行的。

If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.

如果您想验证正则表达式是否正确,您应该考虑测试它是否编译。

#5


1  

It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile. I don't see what use this would be, though. Also, it takes a different syntax.

使用regex的字符串表示是可能的,因为任何字符串都可以与regexes匹配,但不能与re.compile返回的编译版本匹配。我不知道这有什么用。而且,它采用了不同的语法。

Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re module doesn't do it.

编辑:您似乎在寻找一种能力来检测RE所定义的语言是否是另一个RE的子集。是的,我认为这是可能的,但是不,Python的re模块不这样做。

#6


0  

Some clarification is required, I think:

我认为需要做一些澄清:

.

rA = re.compile('(?<! )[a-z52]+')

'(?<! )[a-z52]+' is a pattern

”(? < !)(a-z52)+是一种模式

rA is an instance of class RegexObject whose type is < *type '_sre.SRE_Pattern' >* .

rA是RegexObject类的实例,其类型是< *type '_sre。SRE_Pattern > *。

Personally I use the term regex exclusively for this kind of objects, not for the pattern .

我个人使用regex这个术语专门用于这种对象,而不是模式。

Note that rB = re.compile(rA) is also possible, it produces the same object (id(rA) == id(rB) equals to True)

注意,rB = re.compile(rA)也是可能的,它生成相同的对象(id(rA) = id(rB) = True)


ch = 'lshdgbfcs luyt52uir bqisuytfqr454'
x = rA.match(ch) 
# or
y = rA.search(ch)

x and y are instances of the class MatchObject whose type is *< type '_sre.SRE_Match' >*

x和y是类型为*< type '_sre的MatchObject类的实例。SRE_Match ' > *


.

That said, you want to know if there a way to determine if a regex rA can match all the strings matched by another regex rB while rB matches only a subsest of all the strings matched by rA.

也就是说,您想知道是否有一种方法可以确定regex rA是否匹配由另一个regex rB匹配的所有字符串,而rB只匹配由rA匹配的所有字符串的重写本。

I don't think such a way exists, whatever theoretically or practically.

我不认为这样的方式存在,无论从理论上还是实际上。

#7


0  

pip3 install https://github.com/leafstorm/lexington/archive/master.zip
python3
>>> from lexington.regex import Regex as R
>>> from lexington.regex import Null
>>> from functools import reduce
>>> from string import ascii_lowercase, digits
>>> a_z = reduce(lambda a, b: a | R(b), ascii_lowercase, Null)
>>> b_x = reduce(lambda a, b: a | R(b), ascii_lowercase[1:-2], Null)
>>> a_z | b_x == a_z
True
>>> m_n = R("m") | R("n")
>>> zero_nine = reduce(lambda a, b: a | R(b), digits, Null)
>>> m_n | zero_nine == m_n
False

Also tested successfully with Python 2. See also how to do it with a different library.

还成功地测试了Python 2。请参见如何使用不同的库进行操作。

Alternatively, pip3 install https://github.com/ferno/greenery/archive/master.zip and:

或者,pip3安装https://github.com/ferno/greenery/archive/master.zip,并:

from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n

#1


11  

I think — in theory — to tell whether regexp A matches a subset of what regexp B matches, an algorithm could:

我认为——在理论上——要判断regexp A是否匹配regexp B匹配的子集,算法可以:

  1. Compute the minimal Deterministic Finite Automaton of B and also of the "union" A|B.
  2. 计算B的最小确定性有限自动机和“联合”A|B的最小确定性有限自动机。
  3. Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches.
  4. 检查两个DFAs是否相同。当且仅当A匹配B的子集时,这是成立的。

However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.

然而,在实践中这可能是一个主要的项目。有一些解释,比如从正则表达式构造一个最小状态DFA,但是它们只考虑数学上的纯regexp。您还必须处理Python为方便添加的扩展。此外,如果任何扩展导致语言是非规则的(我不确定是否如此),您可能无法处理这些扩展。

But what are you trying to do? Perhaps there's an easier approach...?

但你想做什么?也许有一个更简单的方法……

#2


6  

In addition to antinome's answer:

除了antinome的回答:

Many of the constructs that are not part of the basic regex definition are still regular, and can be converted after parsing the regex (with a real parser, because the language of regex is not regular itself): (x?) to (x|), (x+) to (xx*), character classes like [a-d] to their corresponding union (a|b|c|d) etc. So one can use these constructs and still test whether one regex matches a subset of the other regex using the DFA comparison mentioned by antinome.

许多结构不基本正则表达式定义的一部分仍然正常,并且可以转换后解析正则表达式(真正的解析器,因为正则表达式本身并不是普通的语言):(x ?)(x |)(x +)(xx *),字符类(模拟)等相应的联盟(c b | | | d)等。所以仍然可以使用这些构造和测试是否一个正则表达式匹配的一个子集其他正则表达式使用DFA antinome提到的比较。

Some constructs, like back references, are not regular, and cannot be represented by NFA or DFA.

有些构造,如后置引用,不是规则的,不能由NFA或DFA表示。

Even the seemingly simple problem of testing whether a regex with back references matches a particular string is NP-complete (http://perl.plover.com/NPC/NPC-3COL.html).

即使是一个看起来简单的问题,即是否使用返回引用的regex与特定的字符串相匹配,也就是NP-complete (http://perl.plover.com/NPC/NPC-3COL.html)。

#3


6  

Verification of the post by "antinome" using two regex : 55* and 5* :

REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]

REGEX_A: 55*[此匹配“5”、“55”、“555”等,不匹配“4”、“54”等]

REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]

REGEX_B: 5*[此匹配"、"5"、"55"、"555"等,且不匹配"4"、"54"等]

[Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]

这里我们假设55*不是隐含的。55。* 5*不是。5。* -这就是为什么5*不匹配4]

REGEX_A can have an NFA as below:

  {A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
           {B} -----------------epsilon --------> {E} 
                          {C} <--- epsilon ------ {E}

REGEX_B can have an NFA as below:

  {A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
  {A} --------------epsilon -----------> {D} 
                 {B} <--- epsilon ------ {D}

Now we can derive NFA * DFA of (REGEX_A|REGEX_B) as below:

  NFA:
  {state A}  ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
                                              {state C} ---epsilon --> {state D} 
                                              {state C} <---epsilon -- {state D}
  {state A}  ---epsilon --> {state E} ---5--> {state F}
                            {state E} ---epsilon --> {state F} 
                            {state E} <---epsilon -- {state F}

  NFA -> DFA:

       |   5          |  epsilon*
   ----+--------------+--------
    A  |  B,C,E,F,G   |   A,C,E,F
    B  |  C,D,E,F     |   B,C,E,F
    c  |  C,D,E,F     |   C
    D  |  C,D,E,F,G   |   C,D,E,F
    E  |  C,D,E,F,G   |   C,E,F
    F  |  C,E,F,G     |   F
    G  |  C,D,E,G     |   C,E,F,G

                    5(epsilon*)
    -------------+---------------------
              A  |  B,C,E,F,G 
      B,C,E,F,G  |  C,D,E,F,G 
      C,D,E,F,G  |  C,D,E,F,G 

    Finally the DFA for (REGEX_A|REGEX_B) is:
         {A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
                                     {C,D,E,F,G}---5--> {C,D,E,F,G}

         Note: {A} is start state and {C,D,E,F,G} is accepting state. 

Similarly DFA for REGEX_A (55*) is:

       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,E  |   A
    B  | C,D,E  |   B,C,E
    C  | C,D,E  |   C
    D  | C,D,E  |   C,D,E
    E  | C,D,E  |   C,E


            5(epsilon*)
   -------+---------------------
       A  |  B,C,E  
   B,C,E  |  C,D,E
   C,D,E  |  C,D,E

    {A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
                                    {C,D,E}--5--->{C,D,E}
Note: {A} is start state and {C,D,E} is accepting state

Similarly DFA for REGEX_B (5*) is:

       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,D  |   A,B,D
    B  | B,C,D  |   B
    C  | B,C,D  |   B,C,D
    D  | B,C,D  |   B,D


            5(epsilon*)
   -------+---------------------
       A  |  B,C,D  
   B,C,D  |  B,C,D

    {A} ---- 5 -----> {B,C,D}
                      {B,C,D} --- 5 ---> {B,C,D}
Note: {A} is start state and {B,C,D} is accepting state

Conclusions:

DFA of REGX_A|REGX_B identical to DFA of REGX_A 
      -- implies REGEX_A is subset of REGEX_B
DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B 
      -- cannot infer about either gerexes.

#4


1  

You should do something along these lines:

你应该做一些类似的事情:

re.match("\[[b-x]-[b-x]\]", "[a-z]")

The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.

正则表达式必须定义字符串应该是什么样子。如果你想匹配一个左方括号,后面跟着一个从b到x的字母,然后是一个破折号,然后是一个从b到x的字母,最后是一个结束方括号,上面的解决方案应该是可行的。

If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.

如果您想验证正则表达式是否正确,您应该考虑测试它是否编译。

#5


1  

It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile. I don't see what use this would be, though. Also, it takes a different syntax.

使用regex的字符串表示是可能的,因为任何字符串都可以与regexes匹配,但不能与re.compile返回的编译版本匹配。我不知道这有什么用。而且,它采用了不同的语法。

Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re module doesn't do it.

编辑:您似乎在寻找一种能力来检测RE所定义的语言是否是另一个RE的子集。是的,我认为这是可能的,但是不,Python的re模块不这样做。

#6


0  

Some clarification is required, I think:

我认为需要做一些澄清:

.

rA = re.compile('(?<! )[a-z52]+')

'(?<! )[a-z52]+' is a pattern

”(? < !)(a-z52)+是一种模式

rA is an instance of class RegexObject whose type is < *type '_sre.SRE_Pattern' >* .

rA是RegexObject类的实例,其类型是< *type '_sre。SRE_Pattern > *。

Personally I use the term regex exclusively for this kind of objects, not for the pattern .

我个人使用regex这个术语专门用于这种对象,而不是模式。

Note that rB = re.compile(rA) is also possible, it produces the same object (id(rA) == id(rB) equals to True)

注意,rB = re.compile(rA)也是可能的,它生成相同的对象(id(rA) = id(rB) = True)


ch = 'lshdgbfcs luyt52uir bqisuytfqr454'
x = rA.match(ch) 
# or
y = rA.search(ch)

x and y are instances of the class MatchObject whose type is *< type '_sre.SRE_Match' >*

x和y是类型为*< type '_sre的MatchObject类的实例。SRE_Match ' > *


.

That said, you want to know if there a way to determine if a regex rA can match all the strings matched by another regex rB while rB matches only a subsest of all the strings matched by rA.

也就是说,您想知道是否有一种方法可以确定regex rA是否匹配由另一个regex rB匹配的所有字符串,而rB只匹配由rA匹配的所有字符串的重写本。

I don't think such a way exists, whatever theoretically or practically.

我不认为这样的方式存在,无论从理论上还是实际上。

#7


0  

pip3 install https://github.com/leafstorm/lexington/archive/master.zip
python3
>>> from lexington.regex import Regex as R
>>> from lexington.regex import Null
>>> from functools import reduce
>>> from string import ascii_lowercase, digits
>>> a_z = reduce(lambda a, b: a | R(b), ascii_lowercase, Null)
>>> b_x = reduce(lambda a, b: a | R(b), ascii_lowercase[1:-2], Null)
>>> a_z | b_x == a_z
True
>>> m_n = R("m") | R("n")
>>> zero_nine = reduce(lambda a, b: a | R(b), digits, Null)
>>> m_n | zero_nine == m_n
False

Also tested successfully with Python 2. See also how to do it with a different library.

还成功地测试了Python 2。请参见如何使用不同的库进行操作。

Alternatively, pip3 install https://github.com/ferno/greenery/archive/master.zip and:

或者,pip3安装https://github.com/ferno/greenery/archive/master.zip,并:

from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n