* +量词究竟做了什么?

时间:2021-11-06 12:55:56

The idea of lazy and greedy are easy to understand, but I have only seen/used *+ once in my regex (in Java) [A]|[^B]*+(?!C) (A,B,C are arbitrary values) simply because it worked when the lazy modifier resulted in a * error.

懒惰和贪婪的想法很容易理解,但我只在我的正则表达式中使用* +一次(在Java中)[A] | [^ B] * +(?!C)(A,B,C是任意值)只是因为它在惰性修饰符导致*错误时起作用。

Because of most search engines' inability to search symbols, I can't find any documentation on this. So what exactly does *+ do and how does it do it?

由于大多数搜索引擎无法搜索符号,我找不到任何关于此的文档。那么* +究竟做了什么以及如何做到这一点?

3 个解决方案

#1


7  

A greedy quantifier matches everything it can and then the pattern backtracks until the match succeeds.

一个贪婪的量词匹配它所能做的一切,然后模式回溯直到匹配成功。

A lazy quantifier forward tracks until the match succeeds.

延迟量词前向跟踪直到匹配成功。

A possessive quantifier matches everything it can and never backtracks.

占有量词可以匹配它所能做的一切,而且永远不会回溯。

The + denotes a possessive quantifier. If can be used as, for example, ++ or *+.

+表示占有量词。如果可以用作例如++或* +。

This ability to prevent backtracking means it can stop catastrophic backtracking.

这种防止回溯的能力意味着它可以阻止灾难性的回溯。

#2


2  

As the other answers point out, *+ is a "possessive quantifier" It matches the previous element as many times as possible, just like a greedy quantifier, but never backtracks.

正如其他答案所指出的那样,* +是一个“占有量词”它尽可能多地匹配前一个元素,就像一个贪婪的量词,但从不回溯。

Why is this useful? Only as a performance optimization. Further, only as a performance optimization when the regex doesn't match. This is an important point to understand about regular expressions: their worst-case performance always occurs when they don't match.

为什么这有用?仅作为性能优化。此外,仅作为正则表达式不匹配时的性能优化。这是理解正则表达式的重要一点:它们的最坏情况性能总是在它们不匹配时发生。

Depending on the regex engine in use, and the details of the regex itself, worst-case performance can sometimes be stunningly bad. For a simple example, take this regex: a*a*a*b, matched against this string: aaaaac.

根据正在使用的正则表达式引擎和正则表达式本身的细节,最坏情况下的性能有时会非常糟糕。举一个简单的例子,取这个正则表达式:a * a * a * b,与这个字符串匹配:aaaaac。

Confronted with this situation, a standard "NFA" type regex engine will do something like this:

面对这种情况,标准的“NFA”型正则表达式引擎将执行以下操作:

  1. Try matching the 1st a 5 times, the 2nd a zero times, and the 3rd a zero times.
  2. 尝试匹配第一个a 5次,第二个匹配零次,第三次匹配零次。

  3. Try matching the b against the c -- it fails.
  4. 尝试将b与c匹配 - 它失败了。

  5. "Backtrack" and match the 1st a 4 times, the 2nd 1 time, and the 3rd zero times.
  6. “回溯”并匹配第1次4次,第2次1次,第3次零次。

  7. Try matching the b again -- it fails.
  8. 尝试再次匹配b - 它失败了。

  9. Backtrack again, try the 1st a 4 times, the 2nd zero times, and the 3rd 1 time.
  10. 再次回溯,尝试第1次4次,第2次零次,第3次1次。

  11. ...
  12. Backtrack, try the 1st a 3 times, the 2nd 2 times, and the 3rd zero times...
  13. 回溯,尝试第1次3次,第2次2次,第3次零次......

(I think you can fill out the next few hundred steps by yourself.)

(我想你可以自己填写接下来的几百个步骤。)

If the regex was a*+a*a*b, that would never happen. It would be more like:

如果正则表达式是* + a * a * b,则永远不会发生。这更像是:

  1. Try matching the 1st a 5 times, the 2nd zero times, and the 3rd zero times.
  2. 尝试匹配第1次5次,第2次零次,第3次零次。

  3. Try matching the b -- it fails.
  4. 尝试匹配b - 它失败了。

  5. Since the 1st a is "possessive", once it has matched 5 times, it can never try matching a smaller number of times. And since there are no as left in the string for the other a*s to match, they can only match zero times. There is nothing left to try, so the match as a whole fails.
  6. 由于第一个a是“占有”,一旦它匹配了5次,它就永远不会尝试匹配较少的次数。并且由于字符串中没有剩余的a * s匹配,因此它们只能匹配零次。没有什么可以尝试的,所以整场比赛都失败了。

  7. There is no step 4. You are done. While your friends are waiting for their regexps to finish executing, you can kick back your heels and crack open your cold beverage of choice.
  8. 没有第4步。你已经完成了。当你的朋友们正在等待他们的regexps完成执行时,你可以放松脚跟并打开你选择的冷饮。

#3


1  

X*+ means X, zero or more times (Possessive)

X * +表示X,零次或多次(占有)

The possessive quantifiers always eat the entire input string, trying once (and only once) for a match. Unlike the greedy quantifiers, possessive quantifiers never back off, even if doing so would allow the overall match to succeed.

占有量词总是占用整个输入字符串,尝试一次(并且只有一次)匹配。与贪婪的量词不同,占有量词永远不会退缩,即使这样做也会使整体匹配成功。

#1


7  

A greedy quantifier matches everything it can and then the pattern backtracks until the match succeeds.

一个贪婪的量词匹配它所能做的一切,然后模式回溯直到匹配成功。

A lazy quantifier forward tracks until the match succeeds.

延迟量词前向跟踪直到匹配成功。

A possessive quantifier matches everything it can and never backtracks.

占有量词可以匹配它所能做的一切,而且永远不会回溯。

The + denotes a possessive quantifier. If can be used as, for example, ++ or *+.

+表示占有量词。如果可以用作例如++或* +。

This ability to prevent backtracking means it can stop catastrophic backtracking.

这种防止回溯的能力意味着它可以阻止灾难性的回溯。

#2


2  

As the other answers point out, *+ is a "possessive quantifier" It matches the previous element as many times as possible, just like a greedy quantifier, but never backtracks.

正如其他答案所指出的那样,* +是一个“占有量词”它尽可能多地匹配前一个元素,就像一个贪婪的量词,但从不回溯。

Why is this useful? Only as a performance optimization. Further, only as a performance optimization when the regex doesn't match. This is an important point to understand about regular expressions: their worst-case performance always occurs when they don't match.

为什么这有用?仅作为性能优化。此外,仅作为正则表达式不匹配时的性能优化。这是理解正则表达式的重要一点:它们的最坏情况性能总是在它们不匹配时发生。

Depending on the regex engine in use, and the details of the regex itself, worst-case performance can sometimes be stunningly bad. For a simple example, take this regex: a*a*a*b, matched against this string: aaaaac.

根据正在使用的正则表达式引擎和正则表达式本身的细节,最坏情况下的性能有时会非常糟糕。举一个简单的例子,取这个正则表达式:a * a * a * b,与这个字符串匹配:aaaaac。

Confronted with this situation, a standard "NFA" type regex engine will do something like this:

面对这种情况,标准的“NFA”型正则表达式引擎将执行以下操作:

  1. Try matching the 1st a 5 times, the 2nd a zero times, and the 3rd a zero times.
  2. 尝试匹配第一个a 5次,第二个匹配零次,第三次匹配零次。

  3. Try matching the b against the c -- it fails.
  4. 尝试将b与c匹配 - 它失败了。

  5. "Backtrack" and match the 1st a 4 times, the 2nd 1 time, and the 3rd zero times.
  6. “回溯”并匹配第1次4次,第2次1次,第3次零次。

  7. Try matching the b again -- it fails.
  8. 尝试再次匹配b - 它失败了。

  9. Backtrack again, try the 1st a 4 times, the 2nd zero times, and the 3rd 1 time.
  10. 再次回溯,尝试第1次4次,第2次零次,第3次1次。

  11. ...
  12. Backtrack, try the 1st a 3 times, the 2nd 2 times, and the 3rd zero times...
  13. 回溯,尝试第1次3次,第2次2次,第3次零次......

(I think you can fill out the next few hundred steps by yourself.)

(我想你可以自己填写接下来的几百个步骤。)

If the regex was a*+a*a*b, that would never happen. It would be more like:

如果正则表达式是* + a * a * b,则永远不会发生。这更像是:

  1. Try matching the 1st a 5 times, the 2nd zero times, and the 3rd zero times.
  2. 尝试匹配第1次5次,第2次零次,第3次零次。

  3. Try matching the b -- it fails.
  4. 尝试匹配b - 它失败了。

  5. Since the 1st a is "possessive", once it has matched 5 times, it can never try matching a smaller number of times. And since there are no as left in the string for the other a*s to match, they can only match zero times. There is nothing left to try, so the match as a whole fails.
  6. 由于第一个a是“占有”,一旦它匹配了5次,它就永远不会尝试匹配较少的次数。并且由于字符串中没有剩余的a * s匹配,因此它们只能匹配零次。没有什么可以尝试的,所以整场比赛都失败了。

  7. There is no step 4. You are done. While your friends are waiting for their regexps to finish executing, you can kick back your heels and crack open your cold beverage of choice.
  8. 没有第4步。你已经完成了。当你的朋友们正在等待他们的regexps完成执行时,你可以放松脚跟并打开你选择的冷饮。

#3


1  

X*+ means X, zero or more times (Possessive)

X * +表示X,零次或多次(占有)

The possessive quantifiers always eat the entire input string, trying once (and only once) for a match. Unlike the greedy quantifiers, possessive quantifiers never back off, even if doing so would allow the overall match to succeed.

占有量词总是占用整个输入字符串,尝试一次(并且只有一次)匹配。与贪婪的量词不同,占有量词永远不会退缩,即使这样做也会使整体匹配成功。