如果字符串匹配单词，短语，布尔AND列表中的任何术语，Python中最快的方法是什么？

I am trying to find a fast way in Python to check if a list of terms can be matched against strings ranging in size from 50 to 50,000 characters.

我试图在Python中找到一种快速方法,以检查术语列表是否可以匹配大小从50到50,000个字符的字符串。

A term can be:

一个术语可以是:

A word, eg. 'apple'

一句话,例如。 '苹果'

A phrase, eg. 'cherry pie'

短语,例如。 '樱桃派'

Boolean ANDing of words and phrases, eg. 'sweet pie AND savoury pie AND meringue'

单词和短语的布尔AND运算,例如。 '甜馅饼和咸味馅饼和蛋白酥皮'

A match is where a word or phrase exists around word boundaries, so:

匹配是词边界周围存在单词或短语的位置,因此:

match(term='apple', string='An apple a day.') # True
match(term='berry pie', string='A delicious berry pie.') # True
match(term='berry pie', string='A delicious blueberry pie.') # False

I currently have about 40 terms, most of them are simple words. The number of terms will increase over time, but I wouldn't expect it to get beyond 400.

我目前有大约40个学期,其中大部分都是简单的单词。术语的数量会随着时间的推移而增加,但我不希望它超过400。

I'm not interested in which term(s) a string matches, or where in the string it matches, I just need a true/false value for a match against each string - it is much more likely that no terms will match the string, so for the 1 in 500 where it does match, I can store the string for further processing.

我对一个字符串匹配的术语或者它匹配的字符串中的哪个字段不感兴趣,我只需要一个匹配每个字符串的true / false值 - 更可能是没有术语匹配字符串,所以对于500匹配的地方,我可以存储字符串以便进一步处理。

Speed is the most important criteria, and I'd like to leverage the existing code of those smarter than me rather than trying to implement a white-paper. :)

速度是最重要的标准,我想利用那些比我聪明的代码,而不是试图实现白皮书。 :)

So far the speediest solution I've come up with is:

到目前为止,我提出的最快速的解决方案是:

def data():
    return [
        "The apple is the pomaceous fruit of the apple tree, species Malus domestica in the rose family (Rosaceae).",
        "This resulted in early armies adopting the style of hunter-foraging.",
        "Beef pie fillings are popular in Australia. Chicken pie fillings are too."
    ]

def boolean_and(terms):
    return '(%s)' % (''.join(['(?=.*\\b%s\\b)' % (term) for term in terms]))

def run():
    words_and_phrases = ['apple', 'cherry pie']
    booleans = [boolean_and(terms) for terms in [['sweet pie', 'savoury pie', 'meringue'], ['chicken pie', 'beef pie']]]
    regex = re.compile(r'(?i)(\b(%s)\b|%s)' % ('|'.join(words_and_phrases), '|'.join(booleans)))
    matched_data = list()
    for d in data():
        if regex.search(d):
            matched_data.append(d)

The regex winds up as:

正则表达式如下:

(?i)(\b(apple|cherry pie)\b|((?=.*\bsweet pie\b)(?=.*\bsavoury pie\b)(?=.*\bmeringue\b))|((?=.*\bchicken pie\b)(?=.*\bbeef pie\b)))

So all the terms are ORed together, case is ignored, the words/phrases are wrapped in \b for word boundaries, the boolean ANDs use lookaheads so that all the terms are matched, but they do not have to match in a particular order.

因此,所有术语都被OR在一起,大小写被忽略,单词/短语被包装在\ b中用于单词边界,布尔ANDs使用前瞻,以便所有术语都匹配,但它们不必按特定顺序匹配。

Timeit results:

 print timeit.Timer('run()', 'from __main__ import run').timeit(number=10000)
 1.41534304619

Without the lookaheads (ie. the boolean ANDs) this is really quick, but once they're added the speed slows down considerably.

如果没有前瞻(即布尔AND),这真的很快,但一旦添加它们,速度就会大大减慢。

Does anybody have ideas on how this could be improved? Is there a way to optimise the lookahead, or maybe an entirely different approach? I don't think stemming will work, as it tends to be a bit greedy with what it matches.

有没有人对如何改进这个有什么想法?有没有办法优化前瞻,或者可能采用完全不同的方法?我不认为词干会起作用,因为它与它匹配的东西往往有点贪心。

2 个解决方案

#1

The boolean AND regex with the multiple lookahead assertions can be sped up considerably by anchoring them to the beginning of the string. Or better yet, use two regexes: one for the ORed list of terms using the re.search method, and a second regex with the boolean ANDed list using the re.match method like so:

具有多个前瞻断言的布尔AND正则表达式可以通过将它们锚定到字符串的开头来大大加速。或者更好的是,使用两个正则表达式:一个使用re.search方法的ORed术语列表,另一个使用re.match方法的布尔ANDed列表的正则表达式如下:

def boolean_and_new(terms):
    return ''.join([r'(?=.*?\b%s\b)' % (term) for term in terms])

def run_new():
    words_and_phrases = ['apple', 'cherry pie']
    booleans = [boolean_and_new(terms) for terms in [
        ['sweet pie', 'savoury pie', 'meringue'],
        ['chicken pie', 'beef pie']]]
    regex1 = re.compile(r'(?i)\b(?:%s)\b' % ('|'.join(words_and_phrases)))
    regex2 = re.compile(r'(?i)%s' % ('|'.join(booleans)))
    matched_data = list()
    for d in data():
        if regex1.search(d) or regex2.match(d):
            matched_data.append(d)

The effective regexes for this data set are:

此数据集的有效正则表达式为:

regex1 = r'(?i)\b(?:apple|cherry pie)\b'
regex2 = r'(?i)(?=.*?\bsweet pie\b)(?=.*?\bsavoury pie\b)(?=.*?\bmeringue\b)|(?=.*?\bchicken pie\b)(?=.*?\bbeef pie\b)'

Note that the second regex effectively has a ^ anchor at the start since its being used with the re.match method. This also includes a few extra (minor) tweaks; removing unnecessary capture groups and changing the greedy dot-star to lazy. This solution runs nearly 10 times faster than the original on my Win32 box running Python 3.0.1.

请注意,第二个正则表达式在开始时有效地具有^锚点,因为它与re.match方法一起使用。这还包括一些额外(次要)调整;删除不必要的捕获组并将贪婪的点星变为懒惰。在运行Python 3.0.1的Win32机箱上,此解决方案的运行速度比原始速度快近10倍。

Additional: So why is it faster? Lets look at a simple example which describes how the NFA regex "engine" works. (Note that the following description derives from the classic work on the subject: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3), which explains all this efficiency stuff in great detail - highly recommended.) Lets say you have a target string containing one word and you want a regex to see if that word is: "apple". Here are two regexes one might compose to do the job:

附加:那为什么它更快?让我们看一个简单的例子,它描述了NFA正则表达式“引擎”的工作原理。 (请注意,以下描述源于关于该主题的经典着作:由Jeffrey Friedl(MRE3)掌握正则表达式(第3版),它非常详细地解释了所有这些效率的内容 - 强烈推荐。)让我们说你有一个目标字符串包含一个单词,你想要一个正则表达式来查看该单词是否为“apple”。以下是两个可能构成完成工作的正则表达式:

re1 = r'^apple'
re2 = r'apple'
s = r'orange'

If your target string is: apple (or applesauce or apple-pie etc.), then both regexes will successfully match very quickly. But if your target string is say: orange, the situation is different. An NFA regex engine must try all possible permutations of the regex on the target string before it can declare match failure. The way that the regex "engine" works, is that it keeps an internal pointer to its current location within the target string, and a second pointer to a location within the regex pattern, and advanced these pointers as it goes about its business. Note that these pointers point to locations between the characters and to begin with, the target string pointer is set to the location before the first letter if the target string.

如果您的目标字符串是:apple(或applesauce或apple-pie等),那么两个正则表达式将很快成功匹配。但如果您的目标字符串是:orange,则情况不同。 NFA正则表达式引擎必须在目标字符串上尝试所有可能的正则表达式排列,然后才能声明匹配失败。正则表达式“引擎”的工作方式是,它保留一个指向目标字符串中当前位置的内部指针,以及指向正则表达式模式中位置的第二个指针,并在它处理业务时提升这些指针。请注意,这些指针指向字符之间的位置,并且首先,如果目标字符串,则将目标字符串指针设置为第一个字母之前的位置。

re1: The first token in the regex is the ^ start of string anchor. This "anchor" is one of the special "assertion" expressions which matches a location in a target string and does not actually match any characters. (Lookahead and lookbehind and the \b word boundary expressions are also assertions which match a location and do not "consume" any characters.) Ok, with the target string pointer initialized to the location before the first letter of the word orange, the regex engine checks if the ^ anchor matches, and it does (because this location is, in fact, the beginning of the string). So the pattern pointer is advanced to the next token in the regex, the letter a. (The target string pointer is not advanced). It then checks to see if the regex literal a matches the target string character o. It does not match. At this point, the regex engine is smart enough to know that the regex can never succeed at any other locations within the target string (because the ^ can never match anywhere but at the start). Thus it can declare match failure.

re1:正则表达式中的第一个标记是字符串锚点的^ start。这个“锚”是特殊的“断言”表达式之一,它匹配目标字符串中的位置,并且实际上不匹配任何字符。 (Lookahead和lookbehind以及\ b字边界表达式也是与位置匹配的断言,并且不“消耗”任何字符。)好的,将目标字符串指针初始化为单词orange的第一个字母之前的位置,正则表达式引擎检查^锚是否匹配,它确实(因为这个位置实际上是字符串的开头)。因此模式指针前进到正则表达式中的下一个标记,即字母a。 (目标字符串指针未提前)。然后检查正则表达式文字是否与目标字符串字符o匹配。它不匹配。此时,正则表达式引擎足够聪明,知道正则表达式永远不会在目标字符串中的任何其他位置成功(因为^在任何地方都不能匹配,但在开始时)。因此它可以声明匹配失败。

re2: In this case the engine begins by checking if the first pattern char a matches the first target char 'o'. Once again, it does not. However, in this case the regex engine is not done! It has determined that the pattern will not match at the first location, but it must try (and fail) at ALL locations with the target string before it can declare match failure. Thus, the engine advances the target string pointer to the next location (Friedl refers to this as the transmission "bump-along"). For each "bump along", it resets the pattern pointer back to the beginning. Thus it checks the first token in the pattern a against the second char in the string: r. This also does not match, so the transmission bumps along again to the next location within the string. At this point it tests the first char of the pattern a against the third char of the target: a, which does match. The engine advances both pointers and checks the second char in the regex p against the fourth character in the target n. This fails. At this point the engine declares failure at the location before the a in orange and then bumps along again to the n. It goes on like this until it fails at every location in the target string, at which point it can declare overall match failure.

re2:在这种情况下,引擎首先检查第一个模式char是否与第一个目标char'o'匹配。再一次,它没有。但是,在这种情况下,正则表达式引擎没有完成!它已确定模式在第一个位置不匹配,但它必须在具有目标字符串的所有位置尝试(并失败)才能声明匹配失败。因此,引擎将目标字符串指针前进到下一个位置(Friedl将其称为传输“碰撞”)。对于每个“碰撞”,它会将模式指针重置为开头。因此,它检查模式a中的第一个标记与字符串中的第二个字符:r。这也不匹配,因此传输再次碰撞到字符串中的下一个位置。此时,它测试模式a的第一个char与目标的第三个char:a,它匹配。引擎前进两个指针并检查正则表达式p中的第二个字符与目标n中的第四个字符。这失败了。此时发动机在a之前的位置声明失败,然后再次撞到n。它继续这样,直到它在目标字符串中的每个位置都失败,此时它可以声明整体匹配失败。

For long subject strings, this extra unnecessary work can take a lot of time. Crafting an accurate and efficient regex is equal parts art and science. And to craft a really great regex, one must understand exactly how the engine actually works under the hood. Gaining this knowledge requires time and effort, but the time spent will (in my experience) pay for itself many times over. And there really is only one good place to effectively learn these skills and that is to sit down and study Mastering Regular Expressions (3rd Edition) and then practice the techniques learned. I can honestly say that this is, hands down, the most useful book I have ever read. (Its even funny!)

对于长主题字符串,这种额外的不必要工作可能会花费很多时间。制作准确有效的正则表达式与艺术和科学同等重要。要制作一个真正伟大的正则表达式,人们必须准确理解发动机在引擎盖下的实际工作原理。获得这些知识需要时间和精力,但花费的时间(根据我的经验)会多次为自己付出代价。确实只有一个有效学习这些技能的好地方,那就是坐下来学习掌握正则表达式(第3版),然后练习所学的技巧。我可以诚实地说,这是我读过的最有用的书。 (它甚至很有趣!)

Hope this helps! 8^)

希望这可以帮助! 8 ^)

#2

I am going to give a partial answer here but why don't you split the test and matching strings on word boundaries and build a set. You can intersect sets very quickly and if the set matches then you can do the expensive regex test.

我将在这里给出一个部分答案,但为什么不在字边界上拆分测试和匹配字符串并构建一个集合。您可以非常快速地交叉集合,如果集合匹配,那么您可以进行昂贵的正则表达式测试。

#1

def boolean_and_new(terms):
    return ''.join([r'(?=.*?\b%s\b)' % (term) for term in terms])

def run_new():
    words_and_phrases = ['apple', 'cherry pie']
    booleans = [boolean_and_new(terms) for terms in [
        ['sweet pie', 'savoury pie', 'meringue'],
        ['chicken pie', 'beef pie']]]
    regex1 = re.compile(r'(?i)\b(?:%s)\b' % ('|'.join(words_and_phrases)))
    regex2 = re.compile(r'(?i)%s' % ('|'.join(booleans)))
    matched_data = list()
    for d in data():
        if regex1.search(d) or regex2.match(d):
            matched_data.append(d)