正则表达式来解析html代码中的链接

时间:2022-10-29 16:11:22

Possible Duplicate:
Regex to get the link in href. [asp.net]

可能重复:Regex在href中获取链接。[asp.net]

I'm working on a method that accepts a string (html code) and returns an array that contains all the links contained with in.

我正在研究一个方法,该方法接受一个字符串(html代码)并返回一个数组,该数组包含in中包含的所有链接。

I've seen a few options for things like html ability pack but It seems a little more complicated than this project calls for

我已经看到了一些选项,比如html能力包,但它似乎比这个项目所要求的要复杂一些

I'm also interested in using regular expression because i don't have much experience with it in general and i think this would be a good learning opportunity.

我也对使用正则表达式感兴趣,因为我对它没有太多的经验,我认为这将是一个很好的学习机会。

My code thus far is

到目前为止,我的代码是

 WebClient client = new WebClient();
            string htmlCode = client.DownloadString(p);
            Regex exp = new Regex(@"http://(www\.)?([^\.]+)\.com", RegexOptions.IgnoreCase);
            string[] test = exp.Split(htmlCode);

but I'm not getting the results I want because I'm still working on the regular expression

但是我没有得到我想要的结果,因为我还在研究正则表达式

sudo code for what I'm looking for is "

我要找的是sudo代码

4 个解决方案

#1


3  

If you are looking for a fool proof solution regular expressions are not your answers. They are fundamentally limited and cannot be used to reliably parse out links, or other tags for that matter, from an HTML file due to the complexity of the HTML language.

如果您正在寻找一个傻瓜证明解决方案,正则表达式不是您的答案。由于HTML语言的复杂性,它们从根本上是有限的,不能用于从HTML文件中可靠地解析链接或其他标记。

Instead you'll need to use an actual HTML DOM API to parse out links.

相反,您需要使用实际的HTML DOM API来解析链接。

#2


2  

Regular Expressions are not the best idea for HTML.

正则表达式不是HTML的最佳方法。

see previous questions:

见以前的问题:

Rather, you want something that already knows how to parse the DOM; otherwise, you're re-inventing the wheel.

相反,您需要一些已经知道如何解析DOM的东西;否则,你就是在重新发明*。

#3


2  

Other users may tell you "No, Stop! Regular expressions should not mix with HTML! It's like mixing bleach and ammonia!". There is a lot of wisdom in that advice, but it's not the full story.

其他用户可能会告诉你“不,停止!”正则表达式不应该与HTML混合!就像把漂白剂和氨混合在一起!这个建议有很多智慧,但并不是全部。

The truth is that regular expressions work just fine for collecting commonly formatted links. However, a better approach would be to use a dedicated tool for this type of thing, such as the HtmlAgilityPack.

事实是,正则表达式对于收集通常格式的链接非常有用。但是,更好的方法是为这种类型的东西使用专用工具,例如HtmlAgilityPack。

If you use regular expressions, you may match 99.9% of the links, but you may miss on rare unanticipated corner cases or malformed html data.

如果您使用正则表达式,您可能匹配99.9%的链接,但是您可能会错过罕见的意外情况或畸形的html数据。

Here's a function I put together that uses the HtmlAgilityPack to meet your requirements:

下面是我整合的一个函数,它使用HtmlAgilityPack来满足您的要求:

    private static IEnumerable<string> DocumentLinks(string sourceHtml)
    {
        HtmlDocument sourceDocument = new HtmlDocument();

        sourceDocument.LoadHtml(sourceHtml);

        return (IEnumerable<string>)sourceDocument.DocumentNode
            .SelectNodes("//a[@href!='#']")
                .Select(n => n.GetAttributeValue("href",""));

    }

This function creates a new HtmlAgilityPack.HtmlDocument, loads a string containing HTML into it, and then uses an xpath query "//a[@href!='#']" to select all of the links on the page that do not point to "#". Then I use the LINQ extension Select to convert the HtmlNodeCollection into a list of strings containing the value of the href attribute - where the link is pointing to.

这个函数创建一个新的HtmlAgilityPack。HtmlDocument,加载一个包含HTML的字符串,然后使用xpath查询“//a[@href!='#']"选择页面上不指向"#"的所有链接。然后我使用LINQ扩展Select将HtmlNodeCollection转换为包含href属性值的字符串列表——链接指向的地方。

Here's an example use:

这里有一个例子使用:

        List<string> links = 
            DocumentLinks((new WebClient())
                .DownloadString("http://google.com")).ToList();

        Debugger.Break();

This should be a lot more effective than regular expressions.

这应该比正则表达式更有效。

#4


0  

You could look for anything that is sort-of-like a url for http/https schema. This is not HTML proof, but it will get you things that looks like http URLs, which is what you need, I suspect. You can add more sachems, and domains.
The regex looks for things that look like URL "in" href attributes (not strictly).

您可以查找任何类似于http/https模式的url的url。这不是HTML证明,但它会给你一些类似http url的东西,我猜这正是你需要的。您可以添加更多的sachems和域。regex在“href属性”中查找类似URL的内容(不是严格地)。

class Program {
    static void Main(string[] args) {
        const string pattern = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
        var regex = new Regex(pattern);
        var urls = new string[] { 
            "href='http://company.com'",
            "href=\"https://company.com\"",
            "href='http://company.org'",
            "href='http://company.org/'",
            "href='http://company.org/path'",
        };

        foreach (var url in urls) {
            Match match = regex.Match(url);
            if (match.Success) {
                Console.WriteLine("{0} -> {1}", url, match.Groups["url"].Value);
            }
        }
    }
}

output:

输出:

href='http://company.com' -> http://company.com
href="https://company.com" -> https://company.com
href='http://company.org' -> http://company.org
href='http://company.org/' -> http://company.org
href='http://company.org/path' -> http://company.org

href='http://company.com' -> company.href ="https://company.com" -> https://company.com href='http://company.org' -> http://company.org href='http://company.org/' -> http://company.href ='http://company.org/path' -> http://company.org

#1


3  

If you are looking for a fool proof solution regular expressions are not your answers. They are fundamentally limited and cannot be used to reliably parse out links, or other tags for that matter, from an HTML file due to the complexity of the HTML language.

如果您正在寻找一个傻瓜证明解决方案,正则表达式不是您的答案。由于HTML语言的复杂性,它们从根本上是有限的,不能用于从HTML文件中可靠地解析链接或其他标记。

Instead you'll need to use an actual HTML DOM API to parse out links.

相反,您需要使用实际的HTML DOM API来解析链接。

#2


2  

Regular Expressions are not the best idea for HTML.

正则表达式不是HTML的最佳方法。

see previous questions:

见以前的问题:

Rather, you want something that already knows how to parse the DOM; otherwise, you're re-inventing the wheel.

相反,您需要一些已经知道如何解析DOM的东西;否则,你就是在重新发明*。

#3


2  

Other users may tell you "No, Stop! Regular expressions should not mix with HTML! It's like mixing bleach and ammonia!". There is a lot of wisdom in that advice, but it's not the full story.

其他用户可能会告诉你“不,停止!”正则表达式不应该与HTML混合!就像把漂白剂和氨混合在一起!这个建议有很多智慧,但并不是全部。

The truth is that regular expressions work just fine for collecting commonly formatted links. However, a better approach would be to use a dedicated tool for this type of thing, such as the HtmlAgilityPack.

事实是,正则表达式对于收集通常格式的链接非常有用。但是,更好的方法是为这种类型的东西使用专用工具,例如HtmlAgilityPack。

If you use regular expressions, you may match 99.9% of the links, but you may miss on rare unanticipated corner cases or malformed html data.

如果您使用正则表达式,您可能匹配99.9%的链接,但是您可能会错过罕见的意外情况或畸形的html数据。

Here's a function I put together that uses the HtmlAgilityPack to meet your requirements:

下面是我整合的一个函数,它使用HtmlAgilityPack来满足您的要求:

    private static IEnumerable<string> DocumentLinks(string sourceHtml)
    {
        HtmlDocument sourceDocument = new HtmlDocument();

        sourceDocument.LoadHtml(sourceHtml);

        return (IEnumerable<string>)sourceDocument.DocumentNode
            .SelectNodes("//a[@href!='#']")
                .Select(n => n.GetAttributeValue("href",""));

    }

This function creates a new HtmlAgilityPack.HtmlDocument, loads a string containing HTML into it, and then uses an xpath query "//a[@href!='#']" to select all of the links on the page that do not point to "#". Then I use the LINQ extension Select to convert the HtmlNodeCollection into a list of strings containing the value of the href attribute - where the link is pointing to.

这个函数创建一个新的HtmlAgilityPack。HtmlDocument,加载一个包含HTML的字符串,然后使用xpath查询“//a[@href!='#']"选择页面上不指向"#"的所有链接。然后我使用LINQ扩展Select将HtmlNodeCollection转换为包含href属性值的字符串列表——链接指向的地方。

Here's an example use:

这里有一个例子使用:

        List<string> links = 
            DocumentLinks((new WebClient())
                .DownloadString("http://google.com")).ToList();

        Debugger.Break();

This should be a lot more effective than regular expressions.

这应该比正则表达式更有效。

#4


0  

You could look for anything that is sort-of-like a url for http/https schema. This is not HTML proof, but it will get you things that looks like http URLs, which is what you need, I suspect. You can add more sachems, and domains.
The regex looks for things that look like URL "in" href attributes (not strictly).

您可以查找任何类似于http/https模式的url的url。这不是HTML证明,但它会给你一些类似http url的东西,我猜这正是你需要的。您可以添加更多的sachems和域。regex在“href属性”中查找类似URL的内容(不是严格地)。

class Program {
    static void Main(string[] args) {
        const string pattern = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
        var regex = new Regex(pattern);
        var urls = new string[] { 
            "href='http://company.com'",
            "href=\"https://company.com\"",
            "href='http://company.org'",
            "href='http://company.org/'",
            "href='http://company.org/path'",
        };

        foreach (var url in urls) {
            Match match = regex.Match(url);
            if (match.Success) {
                Console.WriteLine("{0} -> {1}", url, match.Groups["url"].Value);
            }
        }
    }
}

output:

输出:

href='http://company.com' -> http://company.com
href="https://company.com" -> https://company.com
href='http://company.org' -> http://company.org
href='http://company.org/' -> http://company.org
href='http://company.org/path' -> http://company.org

href='http://company.com' -> company.href ="https://company.com" -> https://company.com href='http://company.org' -> http://company.org href='http://company.org/' -> http://company.href ='http://company.org/path' -> http://company.org