对perl HTML解析有一点帮助

I am working on a small perl program that will open a site and search for the words Hail Reports and give me back the information. I am very new to perl so some of this might be simple to fix. First my code says I am using an unitialized value. Here is what I have

我正在开发一个小型的perl程序,该程序将打开一个站点并搜索Hail Reports这些词语并将其返回给我。我对perl很新,所以其中一些可能很容易解决。首先我的代码说我正在使用一个单一化的值。这就是我所拥有的

#!/usr/bin/perl -w
use LWP::Simple;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";
$html =~ m{Hail Reports} || die;
my $hail = $1;
print "$hail\n";

Secondly, I thought regular expressions would be the easiest way to do what I want, but I am not sure if I can do it with them. I want my program to search for the words Hail Reports and send me back the information between Hails Reports and the words Wind Reports. Is this possible with regular Expressions or should I be using a different method? Here is a snippet of the webpages source code that I want it to send back

其次,我认为正则表达式是最简单的方法来做我想要的,但我不确定我是否可以用它们来做。我希望我的程序能够搜索Hail Reports这些词,并将Hail Reports和Wind Reports之间的信息发回给我。这可以使用常规表达式,还是应该使用不同的方法?以下是我希望它发回的网页源代码片段

     <tr><th colspan="8">Hail Reports (<a href="last3hours_hail.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_hail.csv">Raw Hail CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr> 

#The Data here will change throughout the day so normally there will be more info.
      <tr><td colspan="8" class="highlight" align="center">No reports received</td></tr> 
      <tr><th colspan="8">Wind Reports (<a href="last3hours_wind.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_wind.csv">Raw Wind CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr>

4 个解决方案

#1

You were capturing nothing in $1 because none of your regex was enclosed in parentheses. The following works for me.

你没有在$ 1中捕获任何东西,因为你的正则表达式都没有括在括号中。以下适用于我。

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";

$html =~ m{Hail Reports(.*)Wind Reports}s || die; #Parentheses indicate capture group
my $hail = $1; # $1 contains whatever matched in the (.*) part of above regex
print "$hail\n";

#2

The uninitialized-value warning is coming from $1 -- it's not defined or set anywhere.

未初始化的值警告来自$ 1 - 它没有在任何地方定义或设置。

For a line-level instead of byte-level "between" you could use:

对于行级而不是字节级“之间”,您可以使用:

for (split(/\n/, $html)) {
    print if (/Hail Reports/ .. /Wind Reports/ and !/(?:Hail|Wind) Reports/);
}

#3

Makes use of single and multi-line matches. Plus it only picks up the first match for the between text, which will be a little faster than being greedy.

使用单线和多线匹配。另外,它只会在文本之间选择第一个匹配,这比贪婪要快一些。

#!/usr/bin/perl -w

use strict;
use LWP::Simple;

   sub main{
      my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
                 or die "Could not fetch NWS page.";

      # match single and multiple lines + not greedy
      my ($hail, $between, $wind) = $html =~ m/(Hail Reports)(.*?)(Wind Reports)/sm
                 or die "No Hail/Wind Reports";

      print qq{
               Hail:         $hail
               Wind:         $wind
               Between Text: $between
            };
   }

   main();

#4

Parenthesis capture strings in regular expressions. You have no parenthesis in your regex, so $1 is not set to anything. If you had:

括号捕获正则表达式中的字符串。您的正则表达式中没有括号,因此$ 1未设置为任何值。如果你有:

$html =~ m{(Hail Reports)} || die;

Then $1 would be set to "Hail Reports" if it exists in the $html variable. Since you only want to know if it matched, then you really don't need to capture anything at this point and you could write something like:

然后$ 1将被设置为“Hail Reports”,如果它存在于$ html变量中。既然你只想知道它是否匹配,那么你真的不需要捕获任何东西,你可以写下这样的东西:

unless ( $html =~ /Hail Reports/ ) {
  die "No Hail Reports in HTML";
}

To capture something between the strings you can do something like:

要捕获字符串之间的内容,您可以执行以下操作:

if ( $html =~ /(?<=Hail Reports)(.*?)(?=Wind Reports)/s ) {
  print "Got $1\n";
}

#1