在python中使用多个regex提取特定的文本?

时间:2022-09-13 16:19:32

I have a problem using regular expressions in python 3 so I would be gladful if someone could help me. I have a text file like the one below:

在python 3中使用正则表达式有问题,如果有人能帮助我,我会很高兴。我有一个文本文件,如下所示:

Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end

what I would like to do is to have a list of the text between the headers but including the headers themselves. I am using this regular expression:

我想做的是在标题之间有一个文本列表,但是包括标题本身。我用的是这个正则表达式:

 re.findall(r'(?=(Header.*?Header|Header.*?end))',data, re.DOTALL)

the result is here

这里的结果是

['Header A\ntext text\n text text\n Header', 'Header B\ntext text\n text text\n Header', 'Header C\n text text here is the end']

The thing is that I get the next header in the end of the every item in the list. As you can see every header ends when we find the next header but the last header doesn't end in a specific way

问题是,我在列表中的每一项的末尾都得到了下一个标题。当我们找到下一个标题时,你可以看到每个标题都结束了,但是最后一个标题并没有以特定的方式结束

Is there a way to get a list (not tuple) of every header including its own text as substrings using regular expressions?

是否有一种方法可以使用正则表达式将每个头的列表(而不是元组)包含其自己的文本作为子字符串?

3 个解决方案

#1


1  

Header [^\n]*[\s\S]*?(?=Header|$)

Try this.See demo.

试试这个。看到演示。

https://regex101.com/r/iS6jF6/21

https://regex101.com/r/iS6jF6/21

import re
p = re.compile(r'Header [^\n]*[\s\S]*?(?=Header|$)')
test_str = "Header A\ntext text\ntext text\nHeader B\ntext text\ntext text\nHeader C\ntext text\nhere is the end"

re.findall(p, test_str)

#2


1  

How about:

如何:

re.findall(r'(?=(Header.*?)(?=Header|end))',data, re.DOTALL)

#3


1  

You actually need to use a positive lookahead assertion.

实际上,您需要使用一个积极的前瞻性断言。

>>> s = '''Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end'''
>>> re.findall(r'Header.*?(?=Header)|Header.*?end',s, re.DOTALL)
['Header A\ntext text\ntext text\n', 'Header B\ntext text\ntext text\n', 'Header C\ntext text\nhere is the end']

Include \n inside the positive lookahead in-order to not to get \n character at the last for each item.

在正面的前视中包含\n,以避免在最后为每个项目获取\n字符。

>>> re.findall(r'Header.*?(?=\nHeader)|Header.*?end',s, re.DOTALL)
['Header A\ntext text\ntext text', 'Header B\ntext text\ntext text', 'Header C\ntext text\nhere is the end']

OR

Split your input according to the newline which exists just before to the string Header.

根据字符串头之前存在的换行符分割输入。

>>> re.split(r'\n(?=Header\b)', s)
['Header A\ntext text\ntext text', 'Header B\ntext text\ntext text', 'Header C\ntext text\nhere is the end']

#1


1  

Header [^\n]*[\s\S]*?(?=Header|$)

Try this.See demo.

试试这个。看到演示。

https://regex101.com/r/iS6jF6/21

https://regex101.com/r/iS6jF6/21

import re
p = re.compile(r'Header [^\n]*[\s\S]*?(?=Header|$)')
test_str = "Header A\ntext text\ntext text\nHeader B\ntext text\ntext text\nHeader C\ntext text\nhere is the end"

re.findall(p, test_str)

#2


1  

How about:

如何:

re.findall(r'(?=(Header.*?)(?=Header|end))',data, re.DOTALL)

#3


1  

You actually need to use a positive lookahead assertion.

实际上,您需要使用一个积极的前瞻性断言。

>>> s = '''Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end'''
>>> re.findall(r'Header.*?(?=Header)|Header.*?end',s, re.DOTALL)
['Header A\ntext text\ntext text\n', 'Header B\ntext text\ntext text\n', 'Header C\ntext text\nhere is the end']

Include \n inside the positive lookahead in-order to not to get \n character at the last for each item.

在正面的前视中包含\n,以避免在最后为每个项目获取\n字符。

>>> re.findall(r'Header.*?(?=\nHeader)|Header.*?end',s, re.DOTALL)
['Header A\ntext text\ntext text', 'Header B\ntext text\ntext text', 'Header C\ntext text\nhere is the end']

OR

Split your input according to the newline which exists just before to the string Header.

根据字符串头之前存在的换行符分割输入。

>>> re.split(r'\n(?=Header\b)', s)
['Header A\ntext text\ntext text', 'Header B\ntext text\ntext text', 'Header C\ntext text\nhere is the end']