如何在Perl中从HTML中提取URL和链接文本?

时间:2022-05-30 07:33:06

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.

我之前曾在Groovy中询问过如何做到这一点。但是,由于所有的CPAN库,现在我在Perl中重写我的应用程序。

If the page contained these links:

如果页面包含以下链接:

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>

The output would be:

输出将是:

Google, http://www.google.com
Apple, http://www.apple.com

What is the best way to do this in Perl?

在Perl中执行此操作的最佳方法是什么?

11 个解决方案

#1


39  

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

请看一下使用WWW :: Mechanize模块。它将为您提取您的网页,然后让您轻松使用URL列表。

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

非常简单,如果您希望导航到该页面上的其他URL,它甚至更简单。

Mech is basically a browser in an object.

Mech基本上是一个对象中的浏览器。

#2


11  

Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.

看一下HTML :: LinkExtractor和HTML :: LinkExtor,HTML :: Parser包的一部分。

HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

HTML :: LinkExtractor类似于HTML :: LinkExtor,除了获取URL之外,您还可以获得链接文本。

#3


6  

If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):

如果您喜欢冒险并且想要在没有模块的情况下尝试,那么这样的事情应该有效(根据您的需求进行调整):

#!/usr/bin/perl

if($#ARGV < 0) {
  print "$0: Need URL argument.\n";
  exit 1;
}

my @content = split(/\n/,`wget -qO- $ARGV[0]`);
my @links = grep(/<a.*href=.*>/,@content);

foreach my $c (@links){
  $c =~ /<a.*href="([\s\S]+?)".*>/;
  $link = $1;
  $c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
  $title = $1;
  print "$title, $link\n";
}

There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).

我可能在这里做了一些错误的事情,但它在我编写的一些测试用例中起作用(它没有考虑像如何在Perl中从HTML中提取URL和链接文本?标签之类的东西等)。

#4


6  

I like using pQuery for things like this...

我喜欢用pQuery这样的东西......

use pQuery;

pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
    sub {
        say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
    }
);

Also checkout this previous *.com question Emulation of lex like functionality in Perl or Python for similar answers.

还要检查这个以前的*.com问题仿效lex,比如Perl或Python中的功能,以获得类似的答案。

#5


5  

Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.

另一种方法是使用XPath查询已解析的HTML。在复杂的情况下需要它,比如在具有特定类的div中提取所有链接。为此使用HTML :: TreeBuilder :: XPath。

  my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
  my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
  while (my $node=$nodes->shift) {
    my $t=$node->attr('title');
  }

#6


4  

Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.

Sherm推荐HTML :: LinkExtor,这几乎就是你想要的。不幸的是,它无法返回标签内的文本。

Andy recommended WWW::Mechanize. That's probably the best solution.

安迪推荐WWW :: Mechanize。这可能是最好的解决方案。

If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.

如果您发现WWW :: Mechanize不符合您的喜好,请尝试使用HTML :: TreeBuilder。它将从HTML中构建一个类似DOM的树,然后您可以搜索所需的链接并提取您想要的任何附近内容。

#7


4  

Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.

或者考虑增强HTML :: LinkExtor以执行您想要的操作,并将更改提交给作者。

#8


4  

Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…

之前的答案非常好,我知道我已经迟到了派对,但这在[perl]饲料中受到了冲击,所以......

XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.

XML :: LibXML非常适合HTML解析,并且速度无与伦比。解析格式错误的HTML时设置recover选项。

use XML::LibXML;

my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\@href]") )
{
    printf "%15s -> %s\n",
        $anchor->textContent,
        $anchor->getAttribute("href");
}

__DATA__
<html><head><title/></head><body>
<a href="http://www.google.com">Google</a>
<a href="http://www.apple.com">Apple</a>
</body></html>

–yields–

-yields-

     Google -> http://www.google.com
      Apple -> http://www.apple.com

#9


3  

HTML::LinkExtractor is better than HTML::LinkExtor

HTML :: LinkExtractor比HTML :: LinkExtor更好

It can give both link text and URL.

它可以提供链接文本和URL。

Usage:

用法:

 use HTML::LinkExtractor;
 my $input = q{If <a href="http://apple.com/"> Apple </a>}; #HTML string
 my $LX = new HTML::LinkExtractor(undef,undef,1);
 $LX->parse(\$input);
 for my $Link( @{ $LX->links } ) {
        if( $$Link{_TEXT}=~ m/Apple/ ) {
            print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
        }
    }

#10


2  

HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.

HTML是一种结构化标记语言,必须进行解析才能正确提取其含义。列出的Sherm模块将解析HTML并为您提取链接。如果您知道您的输入将始终以相同的方式形成(不要忘记属性),则可以接受基于特殊正则表达式的解决方案,但解析器几乎总是处理结构化文本的正确答案。

#11


-1  

We can use regular expression to extract the link with its link text. This is also the one way.

我们可以使用正则表达式来提取链接文本的链接。这也是一种方式。

local $/ = '';
my $a = <DATA>;

while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{   
    print "Link:$1 \t Text: $2\n";
}


__DATA__

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>

#1


39  

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

请看一下使用WWW :: Mechanize模块。它将为您提取您的网页,然后让您轻松使用URL列表。

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

非常简单,如果您希望导航到该页面上的其他URL,它甚至更简单。

Mech is basically a browser in an object.

Mech基本上是一个对象中的浏览器。

#2


11  

Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.

看一下HTML :: LinkExtractor和HTML :: LinkExtor,HTML :: Parser包的一部分。

HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

HTML :: LinkExtractor类似于HTML :: LinkExtor,除了获取URL之外,您还可以获得链接文本。

#3


6  

If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):

如果您喜欢冒险并且想要在没有模块的情况下尝试,那么这样的事情应该有效(根据您的需求进行调整):

#!/usr/bin/perl

if($#ARGV < 0) {
  print "$0: Need URL argument.\n";
  exit 1;
}

my @content = split(/\n/,`wget -qO- $ARGV[0]`);
my @links = grep(/<a.*href=.*>/,@content);

foreach my $c (@links){
  $c =~ /<a.*href="([\s\S]+?)".*>/;
  $link = $1;
  $c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
  $title = $1;
  print "$title, $link\n";
}

There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).

我可能在这里做了一些错误的事情,但它在我编写的一些测试用例中起作用(它没有考虑像如何在Perl中从HTML中提取URL和链接文本?标签之类的东西等)。

#4


6  

I like using pQuery for things like this...

我喜欢用pQuery这样的东西......

use pQuery;

pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
    sub {
        say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
    }
);

Also checkout this previous *.com question Emulation of lex like functionality in Perl or Python for similar answers.

还要检查这个以前的*.com问题仿效lex,比如Perl或Python中的功能,以获得类似的答案。

#5


5  

Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.

另一种方法是使用XPath查询已解析的HTML。在复杂的情况下需要它,比如在具有特定类的div中提取所有链接。为此使用HTML :: TreeBuilder :: XPath。

  my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
  my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
  while (my $node=$nodes->shift) {
    my $t=$node->attr('title');
  }

#6


4  

Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.

Sherm推荐HTML :: LinkExtor,这几乎就是你想要的。不幸的是,它无法返回标签内的文本。

Andy recommended WWW::Mechanize. That's probably the best solution.

安迪推荐WWW :: Mechanize。这可能是最好的解决方案。

If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.

如果您发现WWW :: Mechanize不符合您的喜好,请尝试使用HTML :: TreeBuilder。它将从HTML中构建一个类似DOM的树,然后您可以搜索所需的链接并提取您想要的任何附近内容。

#7


4  

Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.

或者考虑增强HTML :: LinkExtor以执行您想要的操作,并将更改提交给作者。

#8


4  

Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…

之前的答案非常好,我知道我已经迟到了派对,但这在[perl]饲料中受到了冲击,所以......

XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.

XML :: LibXML非常适合HTML解析,并且速度无与伦比。解析格式错误的HTML时设置recover选项。

use XML::LibXML;

my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\@href]") )
{
    printf "%15s -> %s\n",
        $anchor->textContent,
        $anchor->getAttribute("href");
}

__DATA__
<html><head><title/></head><body>
<a href="http://www.google.com">Google</a>
<a href="http://www.apple.com">Apple</a>
</body></html>

–yields–

-yields-

     Google -> http://www.google.com
      Apple -> http://www.apple.com

#9


3  

HTML::LinkExtractor is better than HTML::LinkExtor

HTML :: LinkExtractor比HTML :: LinkExtor更好

It can give both link text and URL.

它可以提供链接文本和URL。

Usage:

用法:

 use HTML::LinkExtractor;
 my $input = q{If <a href="http://apple.com/"> Apple </a>}; #HTML string
 my $LX = new HTML::LinkExtractor(undef,undef,1);
 $LX->parse(\$input);
 for my $Link( @{ $LX->links } ) {
        if( $$Link{_TEXT}=~ m/Apple/ ) {
            print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
        }
    }

#10


2  

HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.

HTML是一种结构化标记语言,必须进行解析才能正确提取其含义。列出的Sherm模块将解析HTML并为您提取链接。如果您知道您的输入将始终以相同的方式形成(不要忘记属性),则可以接受基于特殊正则表达式的解决方案,但解析器几乎总是处理结构化文本的正确答案。

#11


-1  

We can use regular expression to extract the link with its link text. This is also the one way.

我们可以使用正则表达式来提取链接文本的链接。这也是一种方式。

local $/ = '';
my $a = <DATA>;

while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{   
    print "Link:$1 \t Text: $2\n";
}


__DATA__

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>