我应该如何解析Perl中的大型XML文件?

时间:2023-01-15 10:19:01

Does reading XML data like in the following code create the DOM tree in memory?

读取以下代码中的XML数据是否会在内存中创建DOM树?

my $xml = new XML::Simple;

my $data = $xml->XMLin($blast_output,ForceArray => 1);

For large XML files should I use a SAX parser, with handlers, etc.?

对于大型XML文件,我应该使用SAX解析器,处理程序等吗?

3 个解决方案

#1


4  

I would say yes to both. The XML::Simple library will create the entire tree in memory and it's a large multiple on the size of the file. For many applications if your XML is over 100MB or so, it'll be practical impossible to entirely load into memory in perl. A SAX parser is a way of getting "events" or notifications as the file is read and tags are opened or closed.

我会对两者都说是。 XML :: Simple库将在内存中创建整个树,它是文件大小的一个大倍数。对于许多应用程序,如果你的XML超过100MB左右,那么在perl中完全加载到内存中是不可能的。 SAX解析器是一种在读取文件和打开或关闭标记时获取“事件”或通知的方法。

Depending on your usage patterns, either a SAX or a DOM based parser could be faster: for example, if you are trying to handle just a few nodes, or every node, in a large file, the SAX mode is probably best. For example, reading a large RSS feed and attempting to parse every item in it.

根据您的使用模式,SAX或基于DOM的解析器可能更快:例如,如果您尝试处理大型文件中的几个节点或每个节点,SAX模式可能是最佳的。例如,读取大型RSS源并尝试解析其中的每个项目。

On the other hand, if you need to cross-reference one part of the file with another part, a DOM parser or accessing via XPath will make more sense - writing it in the "inside-out" manner that a SAX parser requires will be clumsy and tricky.

另一方面,如果您需要将文件的一部分与另一部分交叉引用,那么DOM解析器或通过XPath访问将更有意义 - 以SAX解析器所需的“由内向外”方式编写它将是笨拙而棘手。

I recommend trying a SAX parser at least once, because the event-driven thinking required to do so is good exercise.

我建议至少尝试一次SAX解析器,因为这样做的事件驱动思维是很好的练习。

I've had good success with XML::SAX::Machines to set up SAX parsing in perl - if you want multiple filters and pipelines it's easy to set up. For simpler setups (i.e 99% of the time) you just need a single sax filter (look at XML::Filter::Base) and tell XML::SAX::Machines to just parse the file (or read from filehandle) using your filter. Here's a thorough article.

我已经在XML :: SAX :: Machines中成功地在perl中设置了SAX解析 - 如果你想要多个过滤器和管道,它很容易设置。对于更简单的设置(即99%的时间),您只需要一个sax过滤器(查看XML :: Filter :: Base)并告诉XML :: SAX :: Machines只使用解析文件(或从文件句柄读取)你的过滤器。这是一篇详尽的文章。

#2


14  

For large XML files, you can either use XML::LibXML, in DOM mode if the document fits in memory, or using the pull mode (see XML::LibXML::Reader) or XML::Twig (which I wrote, so I'm biased, but it works generally well for files that are too big to fit in memory).

对于大型XML文件,您可以使用XML :: LibXML,如果文档适合内存,则使用DOM模式,或使用拉模式(请参阅XML :: LibXML :: Reader)或XML :: Twig(我写的,所以我有偏见,但它通常适用于太大而不适合内存的文件)。

I am not a fan of SAX, which is hard to use and in fact quite slow.

我不是SAX的粉丝,它很难使用,实际上很慢。

#3


4  

I have not used the XML::Simple module before, but from the documentation it appears to create a simple hash in memory. This is not a full DOM tree, but may well be enough for your requirements.

我以前没有使用过XML :: Simple模块,但是从文档看来它在内存中创建了一个简单的哈希。这不是一个完整的DOM树,但可能足以满足您的要求。

For large XML files, using a SAX parser would be faster and have a smaller memory footprint, but then it would again depend upon your needs. If you just need to process the data in a serial fashion, then using XML::SAX would probably suit your needs. If you need to manipulate your whole tree, then maybe using something like XML::LibXML would be better for you.

对于大型XML文件,使用SAX解析器会更快并且占用内存更小,但它会再次取决于您的需求。如果您只需要以串行方式处理数据,那么使用XML :: SAX可能会满足您的需求。如果你需要操纵你的整个树,那么使用像XML :: LibXML这样的东西对你来说会更好。

It is all horses for courses i'm afraid

我担心的课程都是马匹

#1


4  

I would say yes to both. The XML::Simple library will create the entire tree in memory and it's a large multiple on the size of the file. For many applications if your XML is over 100MB or so, it'll be practical impossible to entirely load into memory in perl. A SAX parser is a way of getting "events" or notifications as the file is read and tags are opened or closed.

我会对两者都说是。 XML :: Simple库将在内存中创建整个树,它是文件大小的一个大倍数。对于许多应用程序,如果你的XML超过100MB左右,那么在perl中完全加载到内存中是不可能的。 SAX解析器是一种在读取文件和打开或关闭标记时获取“事件”或通知的方法。

Depending on your usage patterns, either a SAX or a DOM based parser could be faster: for example, if you are trying to handle just a few nodes, or every node, in a large file, the SAX mode is probably best. For example, reading a large RSS feed and attempting to parse every item in it.

根据您的使用模式,SAX或基于DOM的解析器可能更快:例如,如果您尝试处理大型文件中的几个节点或每个节点,SAX模式可能是最佳的。例如,读取大型RSS源并尝试解析其中的每个项目。

On the other hand, if you need to cross-reference one part of the file with another part, a DOM parser or accessing via XPath will make more sense - writing it in the "inside-out" manner that a SAX parser requires will be clumsy and tricky.

另一方面,如果您需要将文件的一部分与另一部分交叉引用,那么DOM解析器或通过XPath访问将更有意义 - 以SAX解析器所需的“由内向外”方式编写它将是笨拙而棘手。

I recommend trying a SAX parser at least once, because the event-driven thinking required to do so is good exercise.

我建议至少尝试一次SAX解析器,因为这样做的事件驱动思维是很好的练习。

I've had good success with XML::SAX::Machines to set up SAX parsing in perl - if you want multiple filters and pipelines it's easy to set up. For simpler setups (i.e 99% of the time) you just need a single sax filter (look at XML::Filter::Base) and tell XML::SAX::Machines to just parse the file (or read from filehandle) using your filter. Here's a thorough article.

我已经在XML :: SAX :: Machines中成功地在perl中设置了SAX解析 - 如果你想要多个过滤器和管道,它很容易设置。对于更简单的设置(即99%的时间),您只需要一个sax过滤器(查看XML :: Filter :: Base)并告诉XML :: SAX :: Machines只使用解析文件(或从文件句柄读取)你的过滤器。这是一篇详尽的文章。

#2


14  

For large XML files, you can either use XML::LibXML, in DOM mode if the document fits in memory, or using the pull mode (see XML::LibXML::Reader) or XML::Twig (which I wrote, so I'm biased, but it works generally well for files that are too big to fit in memory).

对于大型XML文件,您可以使用XML :: LibXML,如果文档适合内存,则使用DOM模式,或使用拉模式(请参阅XML :: LibXML :: Reader)或XML :: Twig(我写的,所以我有偏见,但它通常适用于太大而不适合内存的文件)。

I am not a fan of SAX, which is hard to use and in fact quite slow.

我不是SAX的粉丝,它很难使用,实际上很慢。

#3


4  

I have not used the XML::Simple module before, but from the documentation it appears to create a simple hash in memory. This is not a full DOM tree, but may well be enough for your requirements.

我以前没有使用过XML :: Simple模块,但是从文档看来它在内存中创建了一个简单的哈希。这不是一个完整的DOM树,但可能足以满足您的要求。

For large XML files, using a SAX parser would be faster and have a smaller memory footprint, but then it would again depend upon your needs. If you just need to process the data in a serial fashion, then using XML::SAX would probably suit your needs. If you need to manipulate your whole tree, then maybe using something like XML::LibXML would be better for you.

对于大型XML文件,使用SAX解析器会更快并且占用内存更小,但它会再次取决于您的需求。如果您只需要以串行方式处理数据,那么使用XML :: SAX可能会满足您的需求。如果你需要操纵你的整个树,那么使用像XML :: LibXML这样的东西对你来说会更好。

It is all horses for courses i'm afraid

我担心的课程都是马匹