正则表达式从img标签获取src值

I am using the following regex to get the src value of the first img tag in an HTML document.

我使用以下正则表达式来获取HTML文档中第一个img标记的src值。

string match = "src=(?:\"|\')?(?<imgSrc>[^>]*[^/].(?:jpg|png))(?:\"|\')?"

Now it captures total src attribute that I dont need. I just need the url inside the src attribute. How to do it?

现在它捕获了我不需要的总src属性。我只需要src属性中的url。怎么做?

3 个解决方案

#1

Parse your HTML with something else. HTML is not regular and thus regular expressions aren't at all suited to parsing it.

用其他东西解析你的HTML。 HTML不是常规的,因此正则表达式根本不适合解析它。

Use an HTML parser, or an XML parser if the HTML is strict. It's a lot easier to get the src attribute's value using XPath:

如果HTML严格,请使用HTML解析器或XML解析器。使用XPath获取src属性的值要容易得多:

//img/@src

XML parsing is built into the System.Xml namespace. It's incredibly powerful. HTML parsing is a bit more difficult if the HTML isn't strict, but there are lots of libraries around that will do it for you.

XML解析内置于System.Xml命名空间中。它非常强大。如果HTML不严格,HTML解析会有点困难,但是有很多库会为你做这件事。

#2

see When not to use Regex in C# (or Java, C++ etc) and Looking for C# HTML parser

请参阅何时不在C#(或Java,C ++等)中使用Regex并寻找C#HTML解析器

PS, how can I put a link to a * question in a comment?

PS,如何在评论中添加指向*问题的链接?

#3

Your regex should (in english) match on any character after a quote, that is not a quote inside an tag on the src attribute.

您的正则表达式应该(在英语中)匹配引号后的任何字符,这不是src属性上的标记内的引号。

In perl regex, it would be like this:

在Perl正则表达式中,它将是这样的:

/src=[\"\']([^\"\']+)/

The URL will be in $1 after running this.

运行此URL后,URL将为$ 1。

Of course, this assumes that the urls in your src attributes are quoted. You can modify the values in the [] brackets accordingly if they are not.

当然,这假设引用了src属性中的url。如果不是,您可以相应地修改[]括号中的值。

#1