re表达式中单引号内的双引号(python)[duplicate]

时间:2022-09-15 16:18:01

This question already has an answer here:

这个问题在这里已有答案:

I am new to python. I was going through a repository on gitHub , and I saw the following line of code to extract all URLs from a webpage. I understand Regular expressions and capture groups , but I don't understand why there are extra double quotation marks enclosed within the single quotation marks?

我是python的新手。我正在浏览gitHub上的一个存储库,我看到以下代码行从网页中提取所有URL。我理解正则表达式和捕获组,但我不明白为什么单引号中包含额外的双引号?

links = re.findall('"((http|ftp)s?://.*?)"', html)

That is, how is it different from the following code ?

也就是说,它与以下代码有什么不同?

links = re.findall('((http|ftp)s?://.*?)', html)

I tried experimenting and saw that only the first one matches the URL syntax correctly but the second one doesn't . But I don't understand why.

我试过试验,发现只有第一个正确匹配URL语法,但第二个没有。但我不明白为什么。

Any help is appreciated.

任何帮助表示赞赏。

Thank you.

谢谢。

1 个解决方案

#1


1  

The double quotes are part of the regex. They ensure that the pattern only matches if it is actually surrounded by quotes; so foo bar http://whatever.com wouldn't match, but <a href="http://whatever.com"> will.

双引号是正则表达式的一部分。它们确保模式只有在实际被引号括起时才匹配;所以foo bar http://whatever.com不匹配,但会。

Note this is a really fragile way of doing things, though, since single quotes are also valid in HTML but wouldn't match the regex.

请注意,这是一种非常脆弱的处理方式,因为单引号在HTML中也有效但与正则表达式不匹配。

#1


1  

The double quotes are part of the regex. They ensure that the pattern only matches if it is actually surrounded by quotes; so foo bar http://whatever.com wouldn't match, but <a href="http://whatever.com"> will.

双引号是正则表达式的一部分。它们确保模式只有在实际被引号括起时才匹配;所以foo bar http://whatever.com不匹配,但会。

Note this is a really fragile way of doing things, though, since single quotes are also valid in HTML but wouldn't match the regex.

请注意,这是一种非常脆弱的处理方式,因为单引号在HTML中也有效但与正则表达式不匹配。