我怎样才能加速XML :: Twig

时间:2022-10-03 03:56:24

I am using XML::Twig to parse through a very large XML document. I want to split it into chunks based on the <change></change> tags.

我正在使用XML :: Twig来解析一个非常大的XML文档。我想根据 标签将其拆分为块。

Right now I have:

现在我有:

my $xml = XML::Twig->new(twig_handlers => { 'change' => \&parseChange, });
$xml->parsefile($LOGFILE);

sub parseChange {

  my ($xml, $change) = @_;

  my $message = $change->first_child('message');
  my @lines   = $message->children_text('line');

  foreach (@lines) {
    if ($_ =~ /[^a-zA-Z0-9](?i)bug(?-i)[^a-zA-Z0-9]/) {
      print outputData "$_\n";
    }
  }

  outputData->flush();
  $change->purge;
}

Right now this is running the parseChange method when it pulls that block from the XML. It is going extremely slow. I tested it against reading the XML from a file with $/=</change> and writing a function to return the contents of an XML tag and it went much faster.

现在,当它从XML中提取该块时,它正在运行parseChange方法。它变得非常缓慢。我测试它反对从带有$ / = 的文件中读取XML并编写一个函数来返回XML标记的内容,它的速度要快得多。

Is there something I'm missing or am I using XML::Twig incorrectly? I'm new to Perl.

是否有我遗漏的东西或我使用XML :: Twig错误?我是Perl的新手。

EDIT: Here is an example change from the changes file. The file consists of a lot of these one right after the other and there should not be anything in between them:

编辑:以下是更改文件的示例更改。该文件由很多这些文件一个接一个地组成,它们之间不应该有任何东西:

<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>      
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>      
<author_name>Jean-Baptiste Queru</author_name>      
<author_e-mail>jbq@google.com</author_e-mail>      
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>      
<commiter_name>Jean-Baptiste Queru</commiter_name>      
<commiter_email>jbq@google.com</commiter_email>      
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>      
<subject>chmod the output scripts</subject>      
<message>         
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>      
</message>      
<target>         
    <line>generate-blob-scripts.sh</line>      
</target>   
</change>

5 个解决方案

#1


3  

As it stands, your program is processing all of the XML document, including the data outside the change elements that you aren't interested in.

目前,您的程序正在处理所有XML文档,包括您不感兴趣的更改元素之外的数据。

If you change the twig_handlers parameter in your constructor to twig_roots, then the tree structures will be built for only the elements of interest and the rest will be ignored.

如果将构造函数中的twig_handlers参数更改为twig_roots,则将仅为感兴趣的元素构建树结构,其余的将被忽略。

my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });

#2


1  

XML::Twig includes a mechanism by which you can handle tags as they appear, then discard what you no longer need to free memory.

XML :: Twig包含一种机制,您可以通过该机制处理标记,然后丢弃不再需要释放内存的标记。

Here is an example taken from the documentation (which also has a lot more helpful information):

以下是从文档中获取的示例(其中还有更多有用的信息):

my $t= XML::Twig->new( twig_handlers => 
                          { section => \&section,
                            para   => sub { $_->set_tag( 'p'); }
                          },
                       );
  $t->parsefile( 'doc.xml');

  # the handler is called once a section is completely parsed, ie when 
  # the end tag for section is found, it receives the twig itself and
  # the element (including all its sub-elements) as arguments
  sub section 
    { my( $t, $section)= @_;      # arguments for all twig_handlers
      $section->set_tag( 'div');  # change the tag name.4, my favourite method...
      # let's use the attribute nb as a prefix to the title
      my $title= $section->first_child( 'title'); # find the title
      my $nb= $title->att( 'nb'); # get the attribute
      $title->prefix( "$nb - ");  # easy isn't it?
      $section->flush;            # outputs the section and frees memory
    }

This will probably be essential when working with a multi-gigabyte file, because (again, according to the documentation) storing the entire thing in memory can take as much as 10 times the size of the file.

在使用多GB文件时,这可能是必不可少的,因为(再次,根据文档)将整个内容存储在内存中可能需要多达文件大小的10倍。

Edit: A couple of comments based on your edited question. It is not clear exactly what is slowing you down without knowing more about your file structure, but here are a few things to try:

编辑:基于您编辑的问题的几条评论。在不了解您的文件结构的情况下,目前尚不清楚究竟是什么让您失望,但这里有几件事要尝试:

  • Flushing the output filehandle will slow you down if you are writing a lot of lines. Perl caches file writing specifically for performance reasons, and you are bypassing that.
  • 如果你写了很多行,刷新输出文件句柄会减慢你的速度。 Perl专门出于性能原因缓存文件写入,你绕过了它。
  • Instead of using the (?i) mechanism, a rather advanced feature that probably has a performance penalty, why not make the whole match case insensitive? /[^a-z0-9]bug[^a-z0-9]/i is equivalent. You also might be able to simplify it with /\bbug\b/i, which is nearly equivalent, the only difference being that underscores are included in the non-matching class.
  • 而不是使用(?i)机制,一个可能具有性能损失的相当高级的功能,为什么不使整个匹配大小写不敏感? / [^ a-z0-9] bug [^ a-z0-9] / i是等价的。您也可以使用/ \ bbug \ b / i来简化它,这几乎是等价的,唯一的区别是下划线包含在非匹配类中。
  • There are a couple of other simplifications that can be made as well to remove intermediate steps.
  • 除了中间步骤之外,还可以进行其他一些简化。

How does this handler code compare to yours speed-wise?

这个处理程序代码如何与您的速度相比?

sub parseChange
{
    my ($xml, $change) = @_;

    foreach(grep /[^a-z0-9]bug[^a-z0-9]/i, $change->first_child_text('message'))
    {
        print outputData "$_\n";
    }

    $change->purge;
}

#3


0  

If your XML is really big, use XML::SAX. It doesn't have to load entire data set to the memory; instead, it sequentially loads the file and generates callback events for every tag. I successfully used XML::SAX to parse XML with size of more than 1GB. Here is an example of a XML::SAX handler for your data:

如果您的XML非常大,请使用XML :: SAX。它不必将整个数据集加载到内存中;相反,它会按顺序加载文件并为每个标记生成回调事件。我成功地使用XML :: SAX来解析大小超过1GB的XML。以下是数据的XML :: SAX处理程序示例:

#!/usr/bin/env perl
package Change::Extractor;
use 5.010;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    bless { data => '', path => [] }, shift;
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
    push @{$self->{path}} => $el->{Name};
}

sub end_element {
    my ($self, $el) = @_;
    if ($self->{path} ~~ [qw[change message line]]) {
        say $self->{data};
    }
    pop @{$self->{path}};
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use XML::SAX::PurePerl;

my $handler = Change::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_file(\*DATA);

__DATA__
<?xml version="1.0"?>
<change>
  <project>device_common</project>
  <commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
  <tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
  <parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
  <author_name>Jean-Baptiste Queru</author_name>
  <author_e-mail>jbq@google.com</author_e-mail>
  <author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
  <commiter_name>Jean-Baptiste Queru</commiter_name>
  <commiter_email>jbq@google.com</commiter_email>
  <committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
  <subject>chmod the output scripts</subject>
  <message>
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
  </message>
  <target>
    <line>generate-blob-scripts.sh</line>
  </target>
</change>

Outputs

输出

Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f

#4


0  

Not an XML::Twig answer, but ...

不是XML :: Twig的答案,但......

If you're going to extract stuff from xml files, you might want to consider XSLT. Using xsltproc and the following XSL stylesheet, I got the bug-containing change lines out of 1Gb of <change>s in about a minute. Lots of improvements possible, I'm sure.

如果您要从xml文件中提取内容,您可能需要考虑XSLT。使用xsltproc和以下XSL样式表,我在大约一分钟内从1Gb的 中获取了包含bug的更改行。我确信,有很多改进可能。

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >

  <xsl:output method="text"/>
  <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
  <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />

  <xsl:template match="/">
    <xsl:apply-templates select="changes/change/message/line"/>
  </xsl:template>

  <xsl:template match="line">
    <xsl:variable name="lower" select="translate(.,$uppercase,$lowercase)" />
    <xsl:if test="contains($lower,'bug')">
      <xsl:value-of select="."/>
      <xsl:text>
</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

If your XML processing can be done as

如果您的XML处理可以完成

  1. extract to plain text
  2. 提取到纯文本
  3. wrangle flattened text
  4. 争吵扁平的文字
  5. profit
  6. 利润

then XSLT may be the tool for the first step in that process.

那么XSLT可能是该过程第一步的工具。

#5


0  

Mine's taking an horrifically long time.

我的恐怖时间很长。

    my $twig=XML::Twig->new
  (
twig_handlers =>
   {
    SchoolInfo => \&schoolinfo,
   },
   pretty_print => 'indented',
  );

$twig->parsefile( 'data/SchoolInfos.2018-04-17.xml');

sub schoolinfo {
  my( $twig, $l)= @_;
  my $rec = {
                 name   => $l->field('SchoolName'),
                 refid  => $l->{'att'}->{RefId},
                 phone  => $l->field('SchoolPhoneNumber'),
                };

  for my $node ( $l->findnodes( '//Street' ) )    { $rec->{street} = $node->text; }
  for my $node ( $l->findnodes( '//Town' ) )      { $rec->{city} = $node->text; }
  for my $node ( $l->findnodes( '//PostCode' ) )  { $rec->{postcode} = $node->text; }
  for my $node ( $l->findnodes( '//Latitude' ) )  { $rec->{lat} = $node->text; }
  for my $node ( $l->findnodes( '//Longitude' ) ) { $rec->{lng} = $node->text; }     
}

Is it the pretty_print perchance? Otherwise it's pretty straightforward.

这是漂亮的印记吗?否则它非常简单。

#1


3  

As it stands, your program is processing all of the XML document, including the data outside the change elements that you aren't interested in.

目前,您的程序正在处理所有XML文档,包括您不感兴趣的更改元素之外的数据。

If you change the twig_handlers parameter in your constructor to twig_roots, then the tree structures will be built for only the elements of interest and the rest will be ignored.

如果将构造函数中的twig_handlers参数更改为twig_roots,则将仅为感兴趣的元素构建树结构,其余的将被忽略。

my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });

#2


1  

XML::Twig includes a mechanism by which you can handle tags as they appear, then discard what you no longer need to free memory.

XML :: Twig包含一种机制,您可以通过该机制处理标记,然后丢弃不再需要释放内存的标记。

Here is an example taken from the documentation (which also has a lot more helpful information):

以下是从文档中获取的示例(其中还有更多有用的信息):

my $t= XML::Twig->new( twig_handlers => 
                          { section => \&section,
                            para   => sub { $_->set_tag( 'p'); }
                          },
                       );
  $t->parsefile( 'doc.xml');

  # the handler is called once a section is completely parsed, ie when 
  # the end tag for section is found, it receives the twig itself and
  # the element (including all its sub-elements) as arguments
  sub section 
    { my( $t, $section)= @_;      # arguments for all twig_handlers
      $section->set_tag( 'div');  # change the tag name.4, my favourite method...
      # let's use the attribute nb as a prefix to the title
      my $title= $section->first_child( 'title'); # find the title
      my $nb= $title->att( 'nb'); # get the attribute
      $title->prefix( "$nb - ");  # easy isn't it?
      $section->flush;            # outputs the section and frees memory
    }

This will probably be essential when working with a multi-gigabyte file, because (again, according to the documentation) storing the entire thing in memory can take as much as 10 times the size of the file.

在使用多GB文件时,这可能是必不可少的,因为(再次,根据文档)将整个内容存储在内存中可能需要多达文件大小的10倍。

Edit: A couple of comments based on your edited question. It is not clear exactly what is slowing you down without knowing more about your file structure, but here are a few things to try:

编辑:基于您编辑的问题的几条评论。在不了解您的文件结构的情况下,目前尚不清楚究竟是什么让您失望,但这里有几件事要尝试:

  • Flushing the output filehandle will slow you down if you are writing a lot of lines. Perl caches file writing specifically for performance reasons, and you are bypassing that.
  • 如果你写了很多行,刷新输出文件句柄会减慢你的速度。 Perl专门出于性能原因缓存文件写入,你绕过了它。
  • Instead of using the (?i) mechanism, a rather advanced feature that probably has a performance penalty, why not make the whole match case insensitive? /[^a-z0-9]bug[^a-z0-9]/i is equivalent. You also might be able to simplify it with /\bbug\b/i, which is nearly equivalent, the only difference being that underscores are included in the non-matching class.
  • 而不是使用(?i)机制,一个可能具有性能损失的相当高级的功能,为什么不使整个匹配大小写不敏感? / [^ a-z0-9] bug [^ a-z0-9] / i是等价的。您也可以使用/ \ bbug \ b / i来简化它,这几乎是等价的,唯一的区别是下划线包含在非匹配类中。
  • There are a couple of other simplifications that can be made as well to remove intermediate steps.
  • 除了中间步骤之外,还可以进行其他一些简化。

How does this handler code compare to yours speed-wise?

这个处理程序代码如何与您的速度相比?

sub parseChange
{
    my ($xml, $change) = @_;

    foreach(grep /[^a-z0-9]bug[^a-z0-9]/i, $change->first_child_text('message'))
    {
        print outputData "$_\n";
    }

    $change->purge;
}

#3


0  

If your XML is really big, use XML::SAX. It doesn't have to load entire data set to the memory; instead, it sequentially loads the file and generates callback events for every tag. I successfully used XML::SAX to parse XML with size of more than 1GB. Here is an example of a XML::SAX handler for your data:

如果您的XML非常大,请使用XML :: SAX。它不必将整个数据集加载到内存中;相反,它会按顺序加载文件并为每个标记生成回调事件。我成功地使用XML :: SAX来解析大小超过1GB的XML。以下是数据的XML :: SAX处理程序示例:

#!/usr/bin/env perl
package Change::Extractor;
use 5.010;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    bless { data => '', path => [] }, shift;
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
    push @{$self->{path}} => $el->{Name};
}

sub end_element {
    my ($self, $el) = @_;
    if ($self->{path} ~~ [qw[change message line]]) {
        say $self->{data};
    }
    pop @{$self->{path}};
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use XML::SAX::PurePerl;

my $handler = Change::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_file(\*DATA);

__DATA__
<?xml version="1.0"?>
<change>
  <project>device_common</project>
  <commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
  <tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
  <parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
  <author_name>Jean-Baptiste Queru</author_name>
  <author_e-mail>jbq@google.com</author_e-mail>
  <author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
  <commiter_name>Jean-Baptiste Queru</commiter_name>
  <commiter_email>jbq@google.com</commiter_email>
  <committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
  <subject>chmod the output scripts</subject>
  <message>
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
  </message>
  <target>
    <line>generate-blob-scripts.sh</line>
  </target>
</change>

Outputs

输出

Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f

#4


0  

Not an XML::Twig answer, but ...

不是XML :: Twig的答案,但......

If you're going to extract stuff from xml files, you might want to consider XSLT. Using xsltproc and the following XSL stylesheet, I got the bug-containing change lines out of 1Gb of <change>s in about a minute. Lots of improvements possible, I'm sure.

如果您要从xml文件中提取内容,您可能需要考虑XSLT。使用xsltproc和以下XSL样式表,我在大约一分钟内从1Gb的 中获取了包含bug的更改行。我确信,有很多改进可能。

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >

  <xsl:output method="text"/>
  <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
  <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />

  <xsl:template match="/">
    <xsl:apply-templates select="changes/change/message/line"/>
  </xsl:template>

  <xsl:template match="line">
    <xsl:variable name="lower" select="translate(.,$uppercase,$lowercase)" />
    <xsl:if test="contains($lower,'bug')">
      <xsl:value-of select="."/>
      <xsl:text>
</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

If your XML processing can be done as

如果您的XML处理可以完成

  1. extract to plain text
  2. 提取到纯文本
  3. wrangle flattened text
  4. 争吵扁平的文字
  5. profit
  6. 利润

then XSLT may be the tool for the first step in that process.

那么XSLT可能是该过程第一步的工具。

#5


0  

Mine's taking an horrifically long time.

我的恐怖时间很长。

    my $twig=XML::Twig->new
  (
twig_handlers =>
   {
    SchoolInfo => \&schoolinfo,
   },
   pretty_print => 'indented',
  );

$twig->parsefile( 'data/SchoolInfos.2018-04-17.xml');

sub schoolinfo {
  my( $twig, $l)= @_;
  my $rec = {
                 name   => $l->field('SchoolName'),
                 refid  => $l->{'att'}->{RefId},
                 phone  => $l->field('SchoolPhoneNumber'),
                };

  for my $node ( $l->findnodes( '//Street' ) )    { $rec->{street} = $node->text; }
  for my $node ( $l->findnodes( '//Town' ) )      { $rec->{city} = $node->text; }
  for my $node ( $l->findnodes( '//PostCode' ) )  { $rec->{postcode} = $node->text; }
  for my $node ( $l->findnodes( '//Latitude' ) )  { $rec->{lat} = $node->text; }
  for my $node ( $l->findnodes( '//Longitude' ) ) { $rec->{lng} = $node->text; }     
}

Is it the pretty_print perchance? Otherwise it's pretty straightforward.

这是漂亮的印记吗?否则它非常简单。