I have a file that includes a bunch of strings like "size=XXX;". I am trying python's re module for the first time and am a bit mystified by the following behavior: if I use a pipe for 'or' in a regular expression, I only see that bit of the match returned. E.g.:
我有一个包含一堆字符串的文件,例如“size = XXX;”。我是第一次尝试python的re模块,并且由于以下行为而有点神秘:如果我在正则表达式中使用'或'管道,我只会看到匹配的位返回。例如。:
>>> myfile = open('testfile.txt','r').read()
>>> print re.findall('size=50;',myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']
>>> print re.findall('size=51;',myfile)
['size=51;', 'size=51;', 'size=51;']
>>> print re.findall('size=(50|51);',myfile)
['51', '51', '51', '50', '50', '50', '50']
>>> print re.findall(r'size=(50|51);',myfile)
['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone. (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?
比赛的“size =”部分消失了。 (但它确实用于搜索,否则会有更多的结果)。我究竟做错了什么?
5 个解决方案
#1
20
The problem you have is that if the regex that re.findall
tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
你遇到的问题是,如果re.findall尝试匹配捕获组的正则表达式(即括号中括起来的正则表达式部分),那么它是返回的组,而不是匹配的字符串。
One way to solve this issue is to use non-capturing groups (prefixed with ?:
).
解决此问题的一种方法是使用非捕获组(前缀为?:)。
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall
tries to match does not capture anything, it returns the whole of the matched string.
如果re.findall尝试匹配的正则表达式没有捕获任何内容,则返回整个匹配的字符串。
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
虽然在这种特殊情况下使用字符类可能是最简单的选项,但非捕获组提供了更通用的解决方案。
#2
6
When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall()
to only return those groups. Here's the relevant section from the docs:
当正则表达式包含括号时,它们将其内容捕获到组,将findall()的行为更改为仅返回这些组。以下是文档中的相关部分:
(...)
(......)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the
\number
special sequence, described below. To match the literals'('
or')'
, use\(
or\)
, or enclose them inside a character class:[(] [)]
.匹配括号内的正则表达式,并指示组的开始和结束;在执行匹配后,可以检索组的内容,并且可以在字符串中稍后使用\ number特殊序列进行匹配,如下所述。要匹配文字'('或')',请使用\(或\),或将它们包含在字符类中:[(] [)]。
To avoid this behaviour, you can use a non-capturing group:
要避免此行为,您可以使用非捕获组:
>>> print re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
再次,从文档:
(?:...)
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
常规括号的非捕获版本。匹配括号内的正则表达式,但在执行匹配或稍后在模式中引用后,无法检索组匹配的子字符串。
#3
2
'size=(50|51);'
means you are looking for size=50
or size=51
but only matching the 50
or 51
part (note the parentheses), therefore it does not return the sign=
.
'大小=(50 | 51);'意味着您正在寻找size = 50或size = 51但仅匹配50或51部分(请注意括号),因此它不会返回sign =。
If you want the sign=
returned, you can do:
如果你想要sign = return,你可以这样做:
re.findall('(size=50|size=51);',myfile)
#4
1
I think what you want is using [] instead of (). [] indicating set of character while () indicating group match. Try something like this:
我想你想要的是使用[]而不是()。 []表示字符集,而()表示组匹配。尝试这样的事情:
print re.findall('size=5[01];', myfile)
#5
0
In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
在某些情况下,非捕获组是不合适的,例如使用regex检测重复的单词(例如来自python docs)
r'(\b\w+)\s+\1'
In this situation to get whole match one can use
在这种情况下可以使用完全匹配
[groups[0] for groups in re.findall(r'((\b\w+)\s+\2)', text)]
Note that \1
has changed to \2
.
请注意\ 1已更改为\ 2。
#1
20
The problem you have is that if the regex that re.findall
tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
你遇到的问题是,如果re.findall尝试匹配捕获组的正则表达式(即括号中括起来的正则表达式部分),那么它是返回的组,而不是匹配的字符串。
One way to solve this issue is to use non-capturing groups (prefixed with ?:
).
解决此问题的一种方法是使用非捕获组(前缀为?:)。
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall
tries to match does not capture anything, it returns the whole of the matched string.
如果re.findall尝试匹配的正则表达式没有捕获任何内容,则返回整个匹配的字符串。
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
虽然在这种特殊情况下使用字符类可能是最简单的选项,但非捕获组提供了更通用的解决方案。
#2
6
When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall()
to only return those groups. Here's the relevant section from the docs:
当正则表达式包含括号时,它们将其内容捕获到组,将findall()的行为更改为仅返回这些组。以下是文档中的相关部分:
(...)
(......)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the
\number
special sequence, described below. To match the literals'('
or')'
, use\(
or\)
, or enclose them inside a character class:[(] [)]
.匹配括号内的正则表达式,并指示组的开始和结束;在执行匹配后,可以检索组的内容,并且可以在字符串中稍后使用\ number特殊序列进行匹配,如下所述。要匹配文字'('或')',请使用\(或\),或将它们包含在字符类中:[(] [)]。
To avoid this behaviour, you can use a non-capturing group:
要避免此行为,您可以使用非捕获组:
>>> print re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
再次,从文档:
(?:...)
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
常规括号的非捕获版本。匹配括号内的正则表达式,但在执行匹配或稍后在模式中引用后,无法检索组匹配的子字符串。
#3
2
'size=(50|51);'
means you are looking for size=50
or size=51
but only matching the 50
or 51
part (note the parentheses), therefore it does not return the sign=
.
'大小=(50 | 51);'意味着您正在寻找size = 50或size = 51但仅匹配50或51部分(请注意括号),因此它不会返回sign =。
If you want the sign=
returned, you can do:
如果你想要sign = return,你可以这样做:
re.findall('(size=50|size=51);',myfile)
#4
1
I think what you want is using [] instead of (). [] indicating set of character while () indicating group match. Try something like this:
我想你想要的是使用[]而不是()。 []表示字符集,而()表示组匹配。尝试这样的事情:
print re.findall('size=5[01];', myfile)
#5
0
In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
在某些情况下,非捕获组是不合适的,例如使用regex检测重复的单词(例如来自python docs)
r'(\b\w+)\s+\1'
In this situation to get whole match one can use
在这种情况下可以使用完全匹配
[groups[0] for groups in re.findall(r'((\b\w+)\s+\2)', text)]
Note that \1
has changed to \2
.
请注意\ 1已更改为\ 2。