正则表达式：如何找到两个正则表达式匹配之间的子字符串？

Let's say I have a string like:

假设我有一个字符串：

data = 'MESSAGE: Hello world!END OF MESSAGE'

And I want to get the string between 'MESSAGE: ' and the next capitalized word. There are never any fully capitalized words in the message.

我希望得到'MESSAGE：'和下一个大写单词之间的字符串。消息中从未有任何完全大写的单词。

I tried to get this by using this regular expression in re.search:

我试图通过在re.search中使用这个正则表达式来实现这个目的：

re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)

Here I would like it to output 'Hello world!'- but it always returns the wrong result. It is very easy in regular expressions for one to find a sub-string that occurs between two other strings, but how do you find a substring between strings that are matches for a regular expression. I have tried making it a raw string but that didn't seem to work.

在这里，我希望它输出'Hello world！' - 但它总是返回错误的结果。在正则表达式中很容易找到一个在两个其他字符串之间出现的子字符串，但是如何在正则表达式匹配的字符串之间找到子字符串。我试过把它变成原始字符串，但似乎没有用。

I hope I am expressing myself well- I have extensive experience in Python but am new to regular expressions. If possible, I would like an explanation along with an example of how to make my specific example code work. Any helpful posts are greatly appreciated.

我希望自己表达得很好 - 我在Python方面有丰富的经验，但我对正则表达式不熟悉。如果可能的话，我想要一个解释以及如何使我的具体示例代码工作的示例。任何有用的帖子都非常感谢。

BTW, I am using Python 3.3.

顺便说一句，我使用的是Python 3.3。

3 个解决方案

#1

Your code doesn't work but for the opposite reason:

您的代码不起作用，但出于相反的原因：

re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)

would match

会匹配

'Hello world!END OF MESSA'

because (.*) is "greedy", i.e. it matches the most that will allow the rest (two uppercase chars) to match. You need to use a non-greedy quantifier with

因为（。*）是“贪婪的”，即它匹配最多允许其余的（两个大写字符）匹配。你需要使用非贪心量词

re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)

that correctly matches

正确匹配

'Hello world!'

#2

One little question mark:

一个小问号：

re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
Out[91]: 'Hello world!'

if you make the first capturing group lazy, it won't consume anything after the exclamation point.

如果你使第一个捕获组变得懒惰，它将不会在感叹号之后消耗任何东西。

#3

You need your .* to be non-greedy (see the first ?) which means that it stops matching at the point where the next item could match, and you need the second group to be non-capturing (see the ?:).

你需要你的。*非贪婪（见第一个？），这意味着它在下一个项目可以匹配的位置停止匹配，你需要第二个组是非捕获的（参见？:)。

import re 
data = 'MESSAGE: Hello world!END OF MESSAGE'    
regex = r'MESSAGE: (.*?)(?:[A-Z]{2,})'
re.search(regex, data).group(1)

Returns:

'Hello world!'

Alternatively, you could use this:

或者，你可以使用这个：

regex = r'MESSAGE: (.*?)[A-Z]{2,}'

To break this down (I'll include the search line with the VERBOSE flag:):

要打破这个（我将包含带有VERBOSE标志的搜索行:)：

regex = r'''
         MESSAGE:\s    # first part, \s for the space (matches whitespace)
         (.*?)         # non-greedy, anything but a newline
         (?:[A-Z]{2,}) # a secondary group, but non-capturing,
                       #  good for alternatives separated by a pipe, |
         '''
re.search(regex, data, re.VERBOSE).group(1)

#1