如果在Python中使用Regex,则提取部分文本

时间:2022-09-13 16:24:26

I have a a previously matched pattern such as:

我有一个以前匹配的模式,如:

<a href="somelink here something">

Now I wish to extract only the value of a specific attribute(s) in the tag such but this may be anything an occur anywhere in the tag.

现在我希望仅提取标记中特定属性的值,但这可能是标记中任何位置出现的任何内容。

regex_pattern=re.compile('href=\"(.*?)\"') 

Now I can use the above to match the attribute and the value part but I need to extract only the (.*?) part. (Value)

现在我可以使用上面的内容来匹配属性和值部分,但我只需要提取(。*?)部分。 (值)

I can ofcourse strip href=" and " later but I'm sure I can use regex properly to extract only the required part.

我可以稍后去掉href =“和”,但我确信我可以正确使用正则表达式来提取所需的部分。

In simple words I want to match

简单来说,我想要匹配

abcdef=\"______________________\"

in the pattern but want only the

在模式中,但只想要

____________________

Part

部分

How do I do this?

我该怎么做呢?

2 个解决方案

#1


1  

Take a look at the .group() method on regular expression MatchObject results.

在正则表达式MatchObject结果上查看.group()方法。

Your regular expression has an explicit group match group (the part in () parethesis), and the .group() method gives you direct access to the string that was matched within that group. MatchObject are returned by several re functions and methods, including the .search() and .finditer() functions.

您的正则表达式具有显式组匹配组(()parethesis中的部分,而.group()方法使您可以直接访问该组中匹配的字符串。 MatchObject由几个re函数和方法返回,包括.search()和.finditer()函数。

Demonstration:

示范:

>>> import re
>>> example = '<a href="somelink here something">'
>>> regex_pattern=re.compile('href=\"(.*?)\"') 
>>> regex_pattern.search(example)
<_sre.SRE_Match object at 0x1098a2b70>
>>> regex_pattern.search(example).group(1)
'somelink here something'

From the Regular Expression syntax documentation on the (...) parenthesis syntax:

从(...)括号语法的正则表达式语法文档:

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

匹配括号内的正则表达式,并指示组的开始和结束;在执行匹配后,可以检索组的内容,并且可以在字符串中稍后使用\ number特殊序列进行匹配,如下所述。要匹配文字'('或')',请使用\(或\),或将它们包含在字符类中:[(] [)]。

#2


2  

Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.

只需在匹配的字符串yourtext上使用re.search('href = \“(。*?)\”',yourtext).group(1),它就会产生匹配的组。

#1


1  

Take a look at the .group() method on regular expression MatchObject results.

在正则表达式MatchObject结果上查看.group()方法。

Your regular expression has an explicit group match group (the part in () parethesis), and the .group() method gives you direct access to the string that was matched within that group. MatchObject are returned by several re functions and methods, including the .search() and .finditer() functions.

您的正则表达式具有显式组匹配组(()parethesis中的部分,而.group()方法使您可以直接访问该组中匹配的字符串。 MatchObject由几个re函数和方法返回,包括.search()和.finditer()函数。

Demonstration:

示范:

>>> import re
>>> example = '<a href="somelink here something">'
>>> regex_pattern=re.compile('href=\"(.*?)\"') 
>>> regex_pattern.search(example)
<_sre.SRE_Match object at 0x1098a2b70>
>>> regex_pattern.search(example).group(1)
'somelink here something'

From the Regular Expression syntax documentation on the (...) parenthesis syntax:

从(...)括号语法的正则表达式语法文档:

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

匹配括号内的正则表达式,并指示组的开始和结束;在执行匹配后,可以检索组的内容,并且可以在字符串中稍后使用\ number特殊序列进行匹配,如下所述。要匹配文字'('或')',请使用\(或\),或将它们包含在字符类中:[(] [)]。

#2


2  

Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.

只需在匹配的字符串yourtext上使用re.search('href = \“(。*?)\”',yourtext).group(1),它就会产生匹配的组。