Python句子提取器中的正则表达式

时间:2021-08-13 12:51:38

I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.

我有一个脚本,它给出了包含指定关键词列表之一的句子。句子被定义为2个时期之间的任何东西。

Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'

现在我想用它来选择所有的句子,比如'把1.5克粉末放入',如果粉末是一个关键词它会得到整个句子而不是'5克粉'

I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:

我试图找出如何表达一个句子介于句点序列和空间序列之间。我的新过滤器是:

def iterphrases(text):
    return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))

However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.

但是现在我不再打印任何句子只是单词/短语(包括我的关键词)。我对自己做错了很困惑。

3 个解决方案

#1


3  

if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):

如果你没有使用迭代器,re.split对你的用例(句子的自定义定义)来说会更简单一些:

re.split(r'\.\s', text)

Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:

注意最后一句将包括。或者将为空(如果文本在上一个句点之后以空格结尾),以解决此问题:

re.split(r'\.\s', re.sub(r'\.\s*$', '', text))

also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)

另请参阅Python的答案中更为一般的案例 - 用于将文本拆分为句子的RegEx(句子标记化)

and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize

对于一个完全通用的解决方案,您需要一个合适的句子标记器,例如nltk.tokenize

nltk.tokenize.sent_tokenize(text)

#2


2  

Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.

在这里你可以得到它作为迭代器。适用于我的测试用例。它将一个句子视为一个句子(非贪婪)直到一个句号,后面跟着一个空格或一行的结尾。

import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
    return (match.group(0) for match in sentence.finditer(text))

#3


0  

If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:

如果你确定的话。除了句子分隔符之外什么都不用,并且每个相关的句子都以句号结尾,那么以下内容可能有用:

matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]

#1


3  

if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):

如果你没有使用迭代器,re.split对你的用例(句子的自定义定义)来说会更简单一些:

re.split(r'\.\s', text)

Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:

注意最后一句将包括。或者将为空(如果文本在上一个句点之后以空格结尾),以解决此问题:

re.split(r'\.\s', re.sub(r'\.\s*$', '', text))

also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)

另请参阅Python的答案中更为一般的案例 - 用于将文本拆分为句子的RegEx(句子标记化)

and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize

对于一个完全通用的解决方案,您需要一个合适的句子标记器,例如nltk.tokenize

nltk.tokenize.sent_tokenize(text)

#2


2  

Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.

在这里你可以得到它作为迭代器。适用于我的测试用例。它将一个句子视为一个句子(非贪婪)直到一个句号,后面跟着一个空格或一行的结尾。

import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
    return (match.group(0) for match in sentence.finditer(text))

#3


0  

If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:

如果你确定的话。除了句子分隔符之外什么都不用,并且每个相关的句子都以句号结尾,那么以下内容可能有用:

matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]