Regex to parse html for sentences?

时间:2022-10-29 20:38:56

I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.

我知道HTML:Parser是一个东西,通过阅读,我已经意识到尝试用正则表达式解析html通常是一种次优的处理方式,但是对于Perl类我目前正在尝试使用正则表达式(希望如此)只是一个匹配)来识别和存储保存的HTML文档中的句子。最终,我希望能够计算句子数量,单词/句子以及希望页面上的平均单词长度。

For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!

就目前而言,我只是试图隔离“>”之后的事物并在“。”之前,只是为了看看它是否隔离了什么,但即使在操作正则表达式时我也无法运行代码。所以我不确定问题是在正则表达式,其他地方还是两者都有。任何帮助,将不胜感激!

#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;

open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;

print "<pre>";

###Main Program###
&sentences;

###sentence identifier sub###

sub sentences {
@sentences;
while ($html =~ />[^<]\. /gis) {
    push @sentences, $1;
}
#for debugging, comment out when running    
    print join("\n",@sentences);
}

print "</pre>";

3 个解决方案

#1


3  

Your regex should be />[^<]*?./gis

你的正则表达式应该是/> [^ <] *。/。gis

The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.

*?意味着匹配零或更多非贪婪。如上所述,你的正则表达式只会匹配一个非 <字符,后跟一个句点和一个空格。这样它将匹配所有非<直到第一个时期。< p>

There may be other problems.

可能还有其他问题。

Now read this

现在读这个

#2


2  

A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)

第一个改进是写$ html =〜/>([^ <。] +)\。 / gs,你需要捕捉与父母的匹配,并允许每个句子超过1个字母; - )

This does not get all the sentences though, just the first one in each element.

这并不能得到所有句子,只是每个元素中的第一个句子。

A better way would be to capture all the text, then extract sentences from each fragment

更好的方法是捕获所有文本,然后从每个片段中提取句子

while( $html=~ m{>([^<]*<}g) { push @text_content, $1}; 
foreach (@text_content) { while( m{([^.]*)\.}gs) { push @sentences, $1; } }

(untested because it's early in the morning and coffee is calling)

(未经测试,因为它是在清晨,咖啡正在呼叫)

All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.

关于使用regexp解析HTML的所有常见注意事项都适用,最值得注意的是文本中存在“>”。

#3


0  

I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).

我认为这或多或少会影响您的需求。请记住,此脚本仅查看p标记内的文本。文件名作为命令行参数(shift)传入。

#!/usr/bin/perl

 use strict;
 use warnings;
 use HTML::Grabber;

 my $file_location = shift;
 print "\n\nfile: $file_location";
 my $totalWordCount = 0;
 my $sentenceCount = 0;
 my $wordsInSentenceCount = 0;
 my $averageWordsPerSentence = 0;
 my $char_count = 0;
 my $contents;
 my $rounded;
 my $rounded2;

 open ( my $file, '<', $file_location  ) or die "cannot open < file: $!";

    while( my $line = <$file>){
          $contents .= $line;
  }      
 close( $file );
 my $dom = HTML::Grabber->new( html => $contents );

 $dom->find('p')->each( sub{
    my $p_tag = $_->text;

    ++$totalWordCount while $p_tag =~ /\S+/g;


    while ($p_tag =~ /[.!?]+/g){
              $p_tag =~ s/\s//g;
              $char_count += (length($p_tag));
              $sentenceCount++;  
          }
     });     


           print "\n Total Words: $totalWordCount\n";
           print " Total Sentences: $sentenceCount\n";
           $rounded = $totalWordCount / $sentenceCount;
           print  " Average words per sentence: $rounded.\n\n";
           print " Total Characters: $char_count.\n";
           my $averageCharsPerWord = $char_count / $totalWordCount  ;

           $rounded2 = sprintf("%.2f", $averageCharsPerWord );

           print  " Average words per sentence: $rounded2.\n\n";

#1


3  

Your regex should be />[^<]*?./gis

你的正则表达式应该是/> [^ <] *。/。gis

The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.

*?意味着匹配零或更多非贪婪。如上所述,你的正则表达式只会匹配一个非 <字符,后跟一个句点和一个空格。这样它将匹配所有非<直到第一个时期。< p>

There may be other problems.

可能还有其他问题。

Now read this

现在读这个

#2


2  

A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)

第一个改进是写$ html =〜/>([^ <。] +)\。 / gs,你需要捕捉与父母的匹配,并允许每个句子超过1个字母; - )

This does not get all the sentences though, just the first one in each element.

这并不能得到所有句子,只是每个元素中的第一个句子。

A better way would be to capture all the text, then extract sentences from each fragment

更好的方法是捕获所有文本,然后从每个片段中提取句子

while( $html=~ m{>([^<]*<}g) { push @text_content, $1}; 
foreach (@text_content) { while( m{([^.]*)\.}gs) { push @sentences, $1; } }

(untested because it's early in the morning and coffee is calling)

(未经测试,因为它是在清晨,咖啡正在呼叫)

All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.

关于使用regexp解析HTML的所有常见注意事项都适用,最值得注意的是文本中存在“>”。

#3


0  

I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).

我认为这或多或少会影响您的需求。请记住,此脚本仅查看p标记内的文本。文件名作为命令行参数(shift)传入。

#!/usr/bin/perl

 use strict;
 use warnings;
 use HTML::Grabber;

 my $file_location = shift;
 print "\n\nfile: $file_location";
 my $totalWordCount = 0;
 my $sentenceCount = 0;
 my $wordsInSentenceCount = 0;
 my $averageWordsPerSentence = 0;
 my $char_count = 0;
 my $contents;
 my $rounded;
 my $rounded2;

 open ( my $file, '<', $file_location  ) or die "cannot open < file: $!";

    while( my $line = <$file>){
          $contents .= $line;
  }      
 close( $file );
 my $dom = HTML::Grabber->new( html => $contents );

 $dom->find('p')->each( sub{
    my $p_tag = $_->text;

    ++$totalWordCount while $p_tag =~ /\S+/g;


    while ($p_tag =~ /[.!?]+/g){
              $p_tag =~ s/\s//g;
              $char_count += (length($p_tag));
              $sentenceCount++;  
          }
     });     


           print "\n Total Words: $totalWordCount\n";
           print " Total Sentences: $sentenceCount\n";
           $rounded = $totalWordCount / $sentenceCount;
           print  " Average words per sentence: $rounded.\n\n";
           print " Total Characters: $char_count.\n";
           my $averageCharsPerWord = $char_count / $totalWordCount  ;

           $rounded2 = sprintf("%.2f", $averageCharsPerWord );

           print  " Average words per sentence: $rounded2.\n\n";