我想解释Perl的正则表达式引擎的行为

时间:2023-01-20 19:19:37

Update by @Borodin

I've rewritten this code as something I think is more comprehensible. The OP was comparing b with d and suchlike, and I've changed all the symbols to more distinct ASCII characters. The result is equivalent to that of the OP's original code

我把这段代码重写为我认为更容易理解的东西。 OP将b与d等进行比较,并且我已将所有符号更改为更加不同的ASCII字符。结果等同于OP原始代码的结果

I've briefly checked manually all of the regex patterns, but I don't see a discrepancy

我已经手动检查了所有正则表达式模式,但我没有看到差异

#! /usr/local/bin/perl

use strict;
use warnings qw/ all FATAL /;

use List::Util 'max';

my @tests = (
    [ vvOHvXcvv => qr/ ^ ( (v*) O    | H? (v*) X )* c \2 $ /x ],
    [ vvOvXcvv  => qr/ ^ ( (v*) O    | H? (v*) X )* c \2 $ /x ],
    [ vvXHvXcvv => qr/ ^ ( (v*) X    | H? (v*) X )* c \2 $ /x ],
    [ vvXvXcvv  => qr/ ^ ( (v*) X    | H? (v*) X )* c \2 $ /x ],
    [ vvOHvXcvv => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
    [ vvOvXcvv  => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
    [ vvXHvXcvv => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
    [ vvXvXcvv  => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
);

my $w1 = max map length $_->[0], @tests;
my ($no, $yes) = ( 'MATCHES', "doesn't match" );
my $w2 = max map length, $no, $yes;

for my $test ( @tests ) {
    my ( $str, $re ) = @$test;

    printf "%-*s %-*s %s\n",
            $w1+2, qq{"$str"},
            $w2, $str =~ $re ? 'MATCHES' : "doesn't match",
            $re;
}

output

"vvOHvXcvv" MATCHES       (?^x: ^ ( (v*) O    | H? (v*) X )* c \2 $ )
"vvOvXcvv"  MATCHES       (?^x: ^ ( (v*) O    | H? (v*) X )* c \2 $ )
"vvXHvXcvv" MATCHES       (?^x: ^ ( (v*) X    | H? (v*) X )* c \2 $ )
"vvXvXcvv"  doesn't match (?^x: ^ ( (v*) X    | H? (v*) X )* c \2 $ )
"vvOHvXcvv" doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvOvXcvv"  doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvXHvXcvv" doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvXvXcvv"  doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )



The following Perl program tests a few strings against various regex patterns that use back-references. It illustrates a behaviour that I cannot understand.

以下Perl程序针对使用反向引用的各种正则表达式模式测试一些字符串。它说明了我无法理解的行为。

The $snum and $rnum variables are used only to number the strings and patterns in the output for easier reading. The only thing worth reading is the contents of the @test array.

$ snum和$ rnum变量仅用于对输出中的字符串和模式进行编号,以便于阅读。唯一值得一读的是@test数组的内容。

#! /usr/local/bin/perl -w

use strict;
use warnings;

my @test = (
    [ "aadeabcaa", qr/^((a*)d|e?(a*)b)*c\2$/ ],
    [ "aadabcaa", qr/^((a*)d|e?(a*)b)*c\2$/ ],
    [ "aabeabcaa", qr/^((a*)b|e?(a*)b)*c\2$/ ],
    [ "aababcaa", qr/^((a*)b|e?(a*)b)*c\2$/ ],
    [ "aadeabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
    [ "aadabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
    [ "aabeabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
    [ "aababcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
);

my %snum;
my %rnum;
my $lsnum;
my $lrnum;

for ( my $i = 0 ; $i < scalar(@test); $i++ ) {

    my $t = $test[$i];  my $s = $t->[0];  my $r = $t->[1];

    my $snum = ($snum{$s} //= $lsnum++);
    my $rnum = ($rnum{$r} //= $lrnum++);

    my $match = ($s =~ $r);

    print "test $i: (S$snum) $s" .
        ($match?" MATCHES ":" DOES NOT match ") .
        "(R$rnum) $r\n";
}

output

test 0: (S0) aadeabcaa MATCHES (R0) (?^:^((a*)d|e?(a*)b)*c\2$)
test 1: (S1) aadabcaa MATCHES (R0) (?^:^((a*)d|e?(a*)b)*c\2$)
test 2: (S2) aabeabcaa MATCHES (R1) (?^:^((a*)b|e?(a*)b)*c\2$)
test 3: (S3) aababcaa DOES NOT match (R1) (?^:^((a*)b|e?(a*)b)*c\2$)
test 4: (S0) aadeabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 5: (S1) aadabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 6: (S2) aabeabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 7: (S3) aababcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)

Note that egrep (or at any rate, GNU egrep) thinks that every test above is a match.

请注意,egrep(或无论如何,GNU egrep)认为上面的每个测试都是匹配的。

I think that is the theoretically "correct" answer if regexp disjunction is interpreted as a non-deterministic choice, in the sense that there exists a choice of alternatives that will make the match succeed.

我认为,如果regexp析取被解释为一种非确定性的选择,那么理论上就是“正确的”答案,在某种意义上,存在可以使匹配成功的替代选择。

Also note that (S2, S3, R1) are obtained by replacing b for d everywhere in (S0, S1, R0), which is another reason to think that the fourth test should be a match.

还要注意,(S2,S3,R1)是通过在(S0,S1,R0)中的任何地方替换b来获得的,这是认为第四次测试应该匹配的另一个原因。

Intuitively, I would also like tests 4–7 to be matches insofar as tests 0–3 are.

直觉上,我还希望测试4-7是匹配,只要测试0-3是。

I can sort of understand how one would arrive at the fourth test not matching: by trying the left branch and the right right branch in this order at each disjunction, if backtracking does not correctly restore the \2 variable to its prior value, exploring the left branch of the R1 disjunction on the latter ab substring of S3 would clobber \2 to a which would then not be backtracked to its aa value, causing the match to fail (whereas the same thing would not happen in any of the previous tests).

我可以理解一个人如何达到不匹配的第四个测试:通过在每个分离时按此顺序尝试左分支和右分支,如果回溯没有正确地将\ 2变量恢复到其先前值,则探索在S3的后一个ab子串上R1分离的左分支将破坏\ 2到a,然后不会回溯到它的aa值,导致匹配失败(而在之前的任何测试中都不会发生同样的事情) 。

But I have no idea whether my analysis is correct. Why the fifth test doesn't match really escapes me.

但我不知道我的分析是否正确。为什么第五次测试不匹配真的逃脱了我。

So anyway, my question is a combination of the following:

所以无论如何,我的问题是以下的组合:

  • Can someone explain Perl's regexp engine behavior on those examples in detail?

    有人可以详细解释Perl的regexp引擎行为吗?

  • Is this behavior intentional? Is it documented somewhere?

    这种行为是故意的吗?它是在某处记录的吗?

  • Should I file a bug?

    我应该提交错误吗?

3 个解决方案

#1


2  

There is an even easier example of the difference between egrep and Perl:

egrep和Perl之间有一个更简单的例子:

grep -iE '^(([ab])|([ab]))*\2$' <<< abA
abA
perl -wE 'say for shift =~ /^(([ab])|([ab]))*\2$/i' abA

Interestingly, the following matches in Perl (and egrep, too):

有趣的是,Perl中的以下匹配(以及egrep):

grep -iE '^(([ab])|([ab]))*(\3)$' <<< abA
abA
perl -wE 'say for shift =~ /^(([ab])|([ab]))*(\3)$/i' abA
b
b
a
A

So, the first a is matched by the first iteration of *, b is matched by the second one (because \1 eq 'b'). At the same time, \3 eq 'a', but \4 eq 'A'. Why is \3 eq 'a'? It seems to be a result of the previous iteration of the *, which I'd say is a bug.

因此,第一个a与*的第一次迭代匹配,b与第二次迭代相匹配(因为\ 1 eq'b')。与此同时,\ 3 eq'a',但\ 4 eq'A'。为什么\ 3 eq'a'?它似乎是*之前迭代的结果,我说这是一个错误。

Update: I've reported a bug.

更新:我报告了一个错误。

#2


1  

Let's take a go at the fourth example. (Please don't number them from zero! I people, not computer!)

让我们来看看第四个例子。 (请不要从零开始编号!我是人,不是电脑!)

vvXvXcvv

doesn't match

不匹配

qr/ ^ (
    (v*) X
    |
    H? (v*) X
)* c \2 $ /x
  • At the beginning of the string, perl matches the first of the two alternatives. vvX matches (v*) X so there is no need to try the alternative. That also saves capture 2 asvv

    在字符串的开头,perl匹配两个替代中的第一个。 vvX匹配(v *)X所以不需要尝试替代方案。这也节省了捕获2 asvv

    That leaves vXcvv for the engine to match

    这使得引擎的vXcvv匹配

  • Again, perl uses vX to match (v*) X. It saves capture 2 as v and the engine goes around for another try

    同样,perl使用vX来匹配(v *)X。它将捕获2保存为v并且引擎绕过另一次尝试

    That leaves cvv

    这留下了cvv

  • The only options left are another iteration of ( (v*) X | H? (v*) X )*, or falling out of that loop into c \2

    剩下的唯一选项是((v *)X | H?(v *)X)*的另一次迭代,或者从该循环中掉进c \ 2

  • The text doesn't start with v, X, or H so the loop ends, and the next match is c \2, and the regex engine matches the c

    文本不以v,X或H开头,因此循环结束,下一个匹配为c \ 2,正则表达式引擎与c匹配

    Now there is only vv to match

    现在只有匹配的vv

  • perl is now looking for a match to capture 2, which is v. That succeeds

    perl现在正在寻找一个匹配来捕获2,这是v。成功

    The remaining string is just v

    剩下的字符串就是v

  • Now perl is looking for $, which is the end of a string, or just before a newline at the end of a string. It sees v and so it fails

    现在perl正在寻找$,这是字符串的结尾,或者只是在字符串末尾的换行符之前。它看到v,所以它失败了

I really hope that helps. I'm not in a hurry to explain the remaining four examples and I can't yet see why there is confusion

我真的希望有所帮助。我并不急于解释剩下的四个例子,我还不明白为什么会有混乱

I haven't experimented with egrep, and I'm surprised that it behaves differently. Maybe it doesn't stack the captures like Perl does?

我没有尝试过egrep,我很惊讶它的表现不同。也许它没有像Perl那样堆叠捕获?

Please let me know if there's anything of further interest

如果有任何进一步的兴趣,请告诉我

#3


0  

Here is how I understand the behaviour:

以下是我对行为的理解:

test 3: (S3) aababcaa DOES NOT match (R1) (?^:^((a*)b|e?(a*)b)*c\2$)

The first part of the alternative fails here, then we use the second part.

替代方案的第一部分在这里失败,然后我们使用第二部分。

The group 2 contains a so using the backreference the regex is the same as:

第2组包含一个使用反向引用的正则表达式与:

 ^(e?(a*)b)*ca$

That doesn't match the string aababcaa that has aa at the end.

这与最后有aa的字符串aababcaa不匹配。

The match is ok if you have a double aa in the middle: aabaabcaa

如果你在中间有一个双aa,那么比赛是可以的:aabaabcaa

#1


2  

There is an even easier example of the difference between egrep and Perl:

egrep和Perl之间有一个更简单的例子:

grep -iE '^(([ab])|([ab]))*\2$' <<< abA
abA
perl -wE 'say for shift =~ /^(([ab])|([ab]))*\2$/i' abA

Interestingly, the following matches in Perl (and egrep, too):

有趣的是,Perl中的以下匹配(以及egrep):

grep -iE '^(([ab])|([ab]))*(\3)$' <<< abA
abA
perl -wE 'say for shift =~ /^(([ab])|([ab]))*(\3)$/i' abA
b
b
a
A

So, the first a is matched by the first iteration of *, b is matched by the second one (because \1 eq 'b'). At the same time, \3 eq 'a', but \4 eq 'A'. Why is \3 eq 'a'? It seems to be a result of the previous iteration of the *, which I'd say is a bug.

因此,第一个a与*的第一次迭代匹配,b与第二次迭代相匹配(因为\ 1 eq'b')。与此同时,\ 3 eq'a',但\ 4 eq'A'。为什么\ 3 eq'a'?它似乎是*之前迭代的结果,我说这是一个错误。

Update: I've reported a bug.

更新:我报告了一个错误。

#2


1  

Let's take a go at the fourth example. (Please don't number them from zero! I people, not computer!)

让我们来看看第四个例子。 (请不要从零开始编号!我是人,不是电脑!)

vvXvXcvv

doesn't match

不匹配

qr/ ^ (
    (v*) X
    |
    H? (v*) X
)* c \2 $ /x
  • At the beginning of the string, perl matches the first of the two alternatives. vvX matches (v*) X so there is no need to try the alternative. That also saves capture 2 asvv

    在字符串的开头,perl匹配两个替代中的第一个。 vvX匹配(v *)X所以不需要尝试替代方案。这也节省了捕获2 asvv

    That leaves vXcvv for the engine to match

    这使得引擎的vXcvv匹配

  • Again, perl uses vX to match (v*) X. It saves capture 2 as v and the engine goes around for another try

    同样,perl使用vX来匹配(v *)X。它将捕获2保存为v并且引擎绕过另一次尝试

    That leaves cvv

    这留下了cvv

  • The only options left are another iteration of ( (v*) X | H? (v*) X )*, or falling out of that loop into c \2

    剩下的唯一选项是((v *)X | H?(v *)X)*的另一次迭代,或者从该循环中掉进c \ 2

  • The text doesn't start with v, X, or H so the loop ends, and the next match is c \2, and the regex engine matches the c

    文本不以v,X或H开头,因此循环结束,下一个匹配为c \ 2,正则表达式引擎与c匹配

    Now there is only vv to match

    现在只有匹配的vv

  • perl is now looking for a match to capture 2, which is v. That succeeds

    perl现在正在寻找一个匹配来捕获2,这是v。成功

    The remaining string is just v

    剩下的字符串就是v

  • Now perl is looking for $, which is the end of a string, or just before a newline at the end of a string. It sees v and so it fails

    现在perl正在寻找$,这是字符串的结尾,或者只是在字符串末尾的换行符之前。它看到v,所以它失败了

I really hope that helps. I'm not in a hurry to explain the remaining four examples and I can't yet see why there is confusion

我真的希望有所帮助。我并不急于解释剩下的四个例子,我还不明白为什么会有混乱

I haven't experimented with egrep, and I'm surprised that it behaves differently. Maybe it doesn't stack the captures like Perl does?

我没有尝试过egrep,我很惊讶它的表现不同。也许它没有像Perl那样堆叠捕获?

Please let me know if there's anything of further interest

如果有任何进一步的兴趣,请告诉我

#3


0  

Here is how I understand the behaviour:

以下是我对行为的理解:

test 3: (S3) aababcaa DOES NOT match (R1) (?^:^((a*)b|e?(a*)b)*c\2$)

The first part of the alternative fails here, then we use the second part.

替代方案的第一部分在这里失败,然后我们使用第二部分。

The group 2 contains a so using the backreference the regex is the same as:

第2组包含一个使用反向引用的正则表达式与:

 ^(e?(a*)b)*ca$

That doesn't match the string aababcaa that has aa at the end.

这与最后有aa的字符串aababcaa不匹配。

The match is ok if you have a double aa in the middle: aabaabcaa

如果你在中间有一个双aa,那么比赛是可以的:aabaabcaa