有人能解释一下这个regex中的贪婪吗

时间:2022-09-07 11:03:55

I have some text where each line of text has some good words and some bad(unwanted) words. So the pattern might look like this

我有一些文本,每行文本都有一些好的词和一些不好的(不需要的)词。这个模式是这样的

good1-good2 good3 bad1-good4-bad2 some more good words
good1-good2 good3 bad1 bad2 
good1-good2 good3 bad1 bad2 bad3

Now i need to reject everything in a line following and including the first bad word So

现在我需要在一行中拒绝所有东西,包括第一个坏词So

good1-good2 good3 bad1-good4-bad2 some more good words should become good1-good2 good3

good1- good3 bad1-good4-bad2更多的好词应该变成good1-good2 good3。

good1-good2 good3 bad1 bad2 should become good1-good2 good3

好人应该成为好人

good1-good2 good3 bad1 bad2 bad3 should become good1-good2 good3

good1- good3 bad1 bad2 bad3应该变成good1-good2 good3。

I am using python so this was what i did

我用的是python,这就是我做的。

p=re.compile('([\w \d-]+) (bad1|bad2|bad3).+',re.I)
m=p.search('good1-good2 good3 bad1-good4-bad2 ')
m.group(1)

and this gives good1-good2 good3 which is what i want but

这给了good1-good2,这是我想要的。

m=p.search('good1-good2 good3 bad1 bad2 ')
m.group(1)

returns good1-good2 good3 bad1 I thought that because the + is greedy so the + in ([\w \d-]+) goes on matching characters till the end of the line and then it backtracks to find the last bad word which in this case is bad2 but when i do this

返回good1-good2 good3 bad1我认为因为+是贪婪的,所以in ([\w \d-]+)会继续匹配字符直到行尾,然后它会反向查找最后一个坏词bad2,但当我这么做的时候

p=re.compile('([\w \d-]+) (bad1|bad2|bad3).+',re.I)
m=p.search('good1-good2 good3 bad1 bad2 bad3')
m.group(1)

it again returns good1-good2 good3 bad1. Can you please explain that? Because there might be a problem with my understanding of greediness in regex? Although i have figured out to solve this problem by using a regex like this ([\w \d-]+?) (bad1|bad2|bad3).+ but still i do not understand why using ([\w \d-]+) (bad1|bad2|bad3).+ always returns the first bad word(bad1 in this case)?

它又返回good -good2 good3 bad1。你能解释一下吗?因为我对regex中的贪心的理解可能有问题?虽然我已经通过使用这样的regex解决了这个问题([\w \d-]+?) (bad1|bad2|bad3)。但是我还是不明白为什么要使用([\w \d-]+) (bad1|bad2|bad3)。+总是返回第一个坏单词(在本例中是bad1)?

Thanks for the time.

谢谢你的时间。

Edit: But suppose i have a pattern with only good words and no bad words like good1-good2 good3--only good words then what should be the regex? i tried this regex ([\w \d-]+?) ?(bad1|bad2|bad3)?.* but this returns the first letter of the pattern.

编辑:但是假设我有一个只有好词而没有坏词的模式,比如good -good2 good3——只有好词,那么regex应该是什么呢?我试着这个正则表达式((\ w \ d -)+ ?)?(bad1 | bad2 | bad3)?。但这将返回模式的第一个字母。

1 个解决方案

#1


3  

Regarding this case:

关于这种情况下:

m=p.search('good1-good2 good3 bad1 bad2 ')

You are correct. ([\w \d-]+) is greedy so it "eats" as much as possible and backtracks.

你是正确的。([\w \d-]+)是贪婪的,所以它“吃”得越多越好,而且越反其道而行之。

Regarding this case however:

然而关于这种情况下:

m=p.search('good1-good2 good3 bad1 bad2 bad3')

What you're probably not seeing is that your .+ has to match at least one character after the bad word. That's why the regex can't match bad3 as the bad word: if it did, it'd run out of characters for the .+ to match anything. Thus, it backtracks to bad2 once again. Change your .+ to .* to see the difference. It's only because you happened to have an extra space in the first case, i.e. bad2 , that things "worked out as expected" there.

你可能看不到的是你的。+必须在坏词之后匹配至少一个字符。这就是为什么regex不能将bad3作为坏词匹配:如果匹配,那么.+的字符将会耗尽,无法匹配任何内容。因此,它又回到了bad2。把你的。+换成。*看看有什么不同。只是因为你碰巧在第一种情况下有一个额外的空间,也就是bad2,事情在那里“按预期进行”。

In other words, some unfortunate coincidences left you confused; but your understanding of greediness is sound.

换句话说,一些不幸的巧合让你感到困惑;但是你对贪婪的理解是正确的。

EDIT

编辑

For the edited part of the question, as written by @lovesh from the comments below:

@lovesh在以下评论中写道:

([\w \d-]+?) ?(bad1|bad2|bad3|$)

#1


3  

Regarding this case:

关于这种情况下:

m=p.search('good1-good2 good3 bad1 bad2 ')

You are correct. ([\w \d-]+) is greedy so it "eats" as much as possible and backtracks.

你是正确的。([\w \d-]+)是贪婪的,所以它“吃”得越多越好,而且越反其道而行之。

Regarding this case however:

然而关于这种情况下:

m=p.search('good1-good2 good3 bad1 bad2 bad3')

What you're probably not seeing is that your .+ has to match at least one character after the bad word. That's why the regex can't match bad3 as the bad word: if it did, it'd run out of characters for the .+ to match anything. Thus, it backtracks to bad2 once again. Change your .+ to .* to see the difference. It's only because you happened to have an extra space in the first case, i.e. bad2 , that things "worked out as expected" there.

你可能看不到的是你的。+必须在坏词之后匹配至少一个字符。这就是为什么regex不能将bad3作为坏词匹配:如果匹配,那么.+的字符将会耗尽,无法匹配任何内容。因此,它又回到了bad2。把你的。+换成。*看看有什么不同。只是因为你碰巧在第一种情况下有一个额外的空间,也就是bad2,事情在那里“按预期进行”。

In other words, some unfortunate coincidences left you confused; but your understanding of greediness is sound.

换句话说,一些不幸的巧合让你感到困惑;但是你对贪婪的理解是正确的。

EDIT

编辑

For the edited part of the question, as written by @lovesh from the comments below:

@lovesh在以下评论中写道:

([\w \d-]+?) ?(bad1|bad2|bad3|$)