提取regex匹配的一部分

时间:2022-09-13 11:41:18

I want a regular expression to extract the title from a HTML page. Currently I have this:

我想要一个正则表达式从HTML页面中提取标题。目前我有这个:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

Is there a regular expression to extract just the contents of so I don't have to remove the tags?

是否有一个正则表达式只提取其中的内容,这样我就不必删除标记了?

thanks!

谢谢!

9 个解决方案

#1


79  

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):

在regexp和python中的group(1)中使用()来检索捕获的字符串(如果没有找到结果,re.search将返回None,所以不要直接使用group():

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

#2


31  

Please, do NOT use regex to parse markup languages. Use lxml or beautifulsoup.

请不要使用regex解析标记语言。使用lxml或beautifulsoup。

#3


4  

Try using capturing groups:

尝试使用捕获组:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

#4


3  

The provided pieces of code do not cope with Exceptions May I suggest

提供的代码片段不能处理异常

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.

如果没有找到模式或第一个匹配,则默认返回空字符串。

#5


2  

Try:

试一试:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

#6


2  

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

re.search(“ <标题> (. *)< /名称>”,年代,re.IGNORECASE).group(1)

#7


2  

Using regular expressions to parse the HTML is generally not a good idea. You can use any HTML parser like Beautiful Soup for that. Check out http://www.crummy.com/software/BeautifulSoup/documentation.html

使用正则表达式来解析HTML通常不是一个好主意。您可以使用任何HTML解析器,比如Beautiful Soup。看看http://www.crummy.com/software/BeautifulSoup/documentation.html

Also remember that some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

还要记住,有些人遇到问题时,会想“我知道,我会用正则表达式。”现在他们有两个问题。

#8


2  

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

我可以向您推荐一份漂亮的汤吗?Soup是解析所有html文档的一个非常好的库。

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

#9


-1  

I'd think this should suffice:

我认为这就足够了:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

... assuming that your text (HTML) is in a variable named "text."

…假设您的文本(HTML)在一个名为“text”的变量中。

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.

这也假定没有其他HTML标记可以合法地嵌入到HTML标题标记中,也没有方法合法地将任何其他 <字符嵌入到这样的容器 块中。< p>

However ...

然而……

Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

在Python中,不要使用正则表达式进行HTML解析。使用HTML解析器!(除非您打算编写一个完整的解析器,当各种HTML、SGML和XML解析器已经在标准库中时,这将是额外的工作。

If your handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is wide recommended for this purpose.

如果您处理的“真实世界”标记soup HTML(经常不符合任何SGML/XML验证器),那么就使用漂亮的soup包。它并不是在标准库中(然而),但是它被广泛推荐用于这个目的。

Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.

另一个选项是:lxml…它是为适当结构的(符合标准的)HTML编写的。但它也可以选择使用BeautifulSoup作为解析器:ElementSoup。

#1


79  

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):

在regexp和python中的group(1)中使用()来检索捕获的字符串(如果没有找到结果,re.search将返回None,所以不要直接使用group():

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

#2


31  

Please, do NOT use regex to parse markup languages. Use lxml or beautifulsoup.

请不要使用regex解析标记语言。使用lxml或beautifulsoup。

#3


4  

Try using capturing groups:

尝试使用捕获组:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

#4


3  

The provided pieces of code do not cope with Exceptions May I suggest

提供的代码片段不能处理异常

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.

如果没有找到模式或第一个匹配,则默认返回空字符串。

#5


2  

Try:

试一试:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

#6


2  

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

re.search(“ <标题> (. *)< /名称>”,年代,re.IGNORECASE).group(1)

#7


2  

Using regular expressions to parse the HTML is generally not a good idea. You can use any HTML parser like Beautiful Soup for that. Check out http://www.crummy.com/software/BeautifulSoup/documentation.html

使用正则表达式来解析HTML通常不是一个好主意。您可以使用任何HTML解析器,比如Beautiful Soup。看看http://www.crummy.com/software/BeautifulSoup/documentation.html

Also remember that some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

还要记住,有些人遇到问题时,会想“我知道,我会用正则表达式。”现在他们有两个问题。

#8


2  

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

我可以向您推荐一份漂亮的汤吗?Soup是解析所有html文档的一个非常好的库。

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

#9


-1  

I'd think this should suffice:

我认为这就足够了:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

... assuming that your text (HTML) is in a variable named "text."

…假设您的文本(HTML)在一个名为“text”的变量中。

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.

这也假定没有其他HTML标记可以合法地嵌入到HTML标题标记中,也没有方法合法地将任何其他 <字符嵌入到这样的容器 块中。< p>

However ...

然而……

Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

在Python中,不要使用正则表达式进行HTML解析。使用HTML解析器!(除非您打算编写一个完整的解析器,当各种HTML、SGML和XML解析器已经在标准库中时,这将是额外的工作。

If your handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is wide recommended for this purpose.

如果您处理的“真实世界”标记soup HTML(经常不符合任何SGML/XML验证器),那么就使用漂亮的soup包。它并不是在标准库中(然而),但是它被广泛推荐用于这个目的。

Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.

另一个选项是:lxml…它是为适当结构的(符合标准的)HTML编写的。但它也可以选择使用BeautifulSoup作为解析器:ElementSoup。