在python中解析大型伪xml文件

I'm trying to parse* a large file (> 5GB) of structured markup data. The data format is essentially XML but there is no explicit root element. What's the most efficient way to do that?

我正在尝试解析*大型文件（> 5GB）的结构化标记数据。数据格式本质上是XML，但没有明确的根元素。最有效的方法是什么？

The problem with SAX parsers is that they require a root element, so either I've to add a pseudo element to the data stream (is there an equivalent to Java's SequenceInputStream in Python?) or I've to switch to a non-SAX conform event-based parser (is there a successor of sgmllib?)

SAX解析器的问题是它们需要一个根元素，所以要么我要在数据流中添加一个伪元素（在Python中是否相当于Java的SequenceInputStream？）或者我要切换到非SAX符合基于事件的解析器（是否有sgmllib的后继？）

The structure of the data is quite simple. Basically a listing of elements:

数据的结构非常简单。基本上是元素列表：

<Document>
  <docid>1</docid>
  <text>foo</text>
</Document>
<Document>
  <docid>2</docid>
  <text>bar</text>
</Document>

*actually to iterate

*实际上是迭代

4 个解决方案

#1

http://docs.python.org/library/xml.sax.html

Note, that you can pass a 'stream' object to xml.sax.parse. This means you can probably pass any object that has file-like methods (like read) to the parse call... Make your own object, which will firstly put your virtual root start-tag, then the contents of file, then virtual root end-tag. I guess that you only need to implement read method... but this might depend on the sax parser you'll use.

请注意，您可以将“stream”对象传递给xml.sax.parse。这意味着您可以将任何具有类文件方法（如read）的对象传递给解析调用...创建自己的对象，首先将您的虚拟根开始标记，然后是文件内容，然后是虚拟根目录结束标签。我想你只需要实现read方法......但这可能取决于你将使用的sax解析器。

Example that works for me:

适用于我的示例：

import xml.sax
import xml.sax.handler

class PseudoStream(object):
    def read_iterator(self):
        yield '<foo>'
        yield '<bar>'
        for line in open('test.xml'):
            yield line
        yield '</bar>'
        yield '</foo>'

    def __init__(self):
        self.ri = self.read_iterator()

    def read(self, *foo):
        try:
            return self.ri.next()
        except StopIteration:
            return ''

class SAXHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print name, attrs

d = xml.sax.parse(PseudoStream(), SAXHandler())

#2

The quick and dirty answer would be adding a root element (as String) so it would be a valid XML.

快速而肮脏的答案是添加一个根元素（作为String），因此它将是一个有效的XML。

Regards.

问候。

#3

Add root element and use SAX, STax or VTD-XML ..

添加根元素并使用SAX，STax或VTD-XML ..

#4

xml.parsers.expat -- Fast XML parsing using Expat The xml.parsers.expat module is a Python interface to the Expat non-validating XML parser. The module provides a single extension type, xmlparser, that represents the current state of an XML parser. After an xmlparser object has been created, various attributes of the object can be set to handler functions. When an XML document is then fed to the parser, the handler functions are called for the character data and markup in the XML document.

xml.parsers.expat - 使用Expat进行快速XML解析xml.parsers.expat模块是Expat非验证XML解析器的Python接口。该模块提供单个扩展类型xmlparser，它表示XML解析器的当前状态。创建xmlparser对象后，可以将对象的各种属性设置为处理函数。然后，当XML文档被提供给解析器时，将为XML文档中的字符数据和标记调用处理函数。

More info : http://www.python.org/doc/2.5/lib/module-xml.parsers.expat.html

更多信息：http：//www.python.org/doc/2.5/lib/module-xml.parsers.expat.html

#1