我应该如何解析Perl中的大型XML文件？

Does reading XML data like in the following code create the DOM tree in memory?

读取以下代码中的XML数据是否会在内存中创建DOM树？

my $xml = new XML::Simple;

my $data = $xml->XMLin($blast_output,ForceArray => 1);

For large XML files should I use a SAX parser, with handlers, etc.?

对于大型XML文件，我应该使用SAX解析器，处理程序等吗？

3 个解决方案

#1

I would say yes to both. The XML::Simple library will create the entire tree in memory and it's a large multiple on the size of the file. For many applications if your XML is over 100MB or so, it'll be practical impossible to entirely load into memory in perl. A SAX parser is a way of getting "events" or notifications as the file is read and tags are opened or closed.

我会对两者都说是。 XML :: Simple库将在内存中创建整个树，它是文件大小的一个大倍数。对于许多应用程序，如果你的XML超过100MB左右，那么在perl中完全加载到内存中是不可能的。 SAX解析器是一种在读取文件和打开或关闭标记时获取“事件”或通知的方法。

Depending on your usage patterns, either a SAX or a DOM based parser could be faster: for example, if you are trying to handle just a few nodes, or every node, in a large file, the SAX mode is probably best. For example, reading a large RSS feed and attempting to parse every item in it.

根据您的使用模式，SAX或基于DOM的解析器可能更快：例如，如果您尝试处理大型文件中的几个节点或每个节点，SAX模式可能是最佳的。例如，读取大型RSS源并尝试解析其中的每个项目。

On the other hand, if you need to cross-reference one part of the file with another part, a DOM parser or accessing via XPath will make more sense - writing it in the "inside-out" manner that a SAX parser requires will be clumsy and tricky.

另一方面，如果您需要将文件的一部分与另一部分交叉引用，那么DOM解析器或通过XPath访问将更有意义 - 以SAX解析器所需的“由内向外”方式编写它将是笨拙而棘手。

I recommend trying a SAX parser at least once, because the event-driven thinking required to do so is good exercise.

我建议至少尝试一次SAX解析器，因为这样做的事件驱动思维是很好的练习。

I've had good success with XML::SAX::Machines to set up SAX parsing in perl - if you want multiple filters and pipelines it's easy to set up. For simpler setups (i.e 99% of the time) you just need a single sax filter (look at XML::Filter::Base) and tell XML::SAX::Machines to just parse the file (or read from filehandle) using your filter. Here's a thorough article.

我已经在XML :: SAX :: Machines中成功地在perl中设置了SAX解析 - 如果你想要多个过滤器和管道，它很容易设置。对于更简单的设置（即99％的时间），您只需要一个sax过滤器（查看XML :: Filter :: Base）并告诉XML :: SAX :: Machines只使用解析文件（或从文件句柄读取）你的过滤器。这是一篇详尽的文章。

#2

For large XML files, you can either use XML::LibXML, in DOM mode if the document fits in memory, or using the pull mode (see XML::LibXML::Reader) or XML::Twig (which I wrote, so I'm biased, but it works generally well for files that are too big to fit in memory).

对于大型XML文件，您可以使用XML :: LibXML，如果文档适合内存，则使用DOM模式，或使用拉模式（请参阅XML :: LibXML :: Reader）或XML :: Twig（我写的，所以我有偏见，但它通常适用于太大而不适合内存的文件）。

I am not a fan of SAX, which is hard to use and in fact quite slow.

我不是SAX的粉丝，它很难使用，实际上很慢。

#3

I have not used the XML::Simple module before, but from the documentation it appears to create a simple hash in memory. This is not a full DOM tree, but may well be enough for your requirements.

我以前没有使用过XML :: Simple模块，但是从文档看来它在内存中创建了一个简单的哈希。这不是一个完整的DOM树，但可能足以满足您的要求。

For large XML files, using a SAX parser would be faster and have a smaller memory footprint, but then it would again depend upon your needs. If you just need to process the data in a serial fashion, then using XML::SAX would probably suit your needs. If you need to manipulate your whole tree, then maybe using something like XML::LibXML would be better for you.

对于大型XML文件，使用SAX解析器会更快并且占用内存更小，但它会再次取决于您的需求。如果您只需要以串行方式处理数据，那么使用XML :: SAX可能会满足您的需求。如果你需要操纵你的整个树，那么使用像XML :: LibXML这样的东西对你来说会更好。

It is all horses for courses i'm afraid

我担心的课程都是马匹

#1