使用lxml和iterparse()解析一个大的(+ 1Gb) XML文件。

时间:2022-04-23 17:00:46

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":

我必须使用如下结构解析1Gb XML文件,并在“Author”和“Content”标签中提取文本:

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse(). The first option I've got it to work, but it is very slow. The second option I haven't managed to get it off the ground.

到目前为止,我已经尝试了两件事:i)读取整个文件并使用.find(xmltag)和ii)使用lxml和iterparse()解析xml文件。第一个选择我已经让它工作了,但它很慢。第二种选择我没有设法让它开始。

Here's part of what I have:

这是我所拥有的一部分:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

The result of that is only blank spaces, with no text in them.

结果只是空格,没有文字。

I must be doing something wrong, but I can't grasp it. Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml. Please, help!

我必须做错事,但我无法理解。另外,如果它不够明显,我对python很新,这是我第一次使用lxml。请帮忙!

3 个解决方案

#1


20  

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    for child in element:
        print child.tag, child.text
    element.clear()

the final clear will stop you from using too much memory.

最后的清除将阻止你使用太多的内存。

[update:] to get "everything between ... as a string" i guess you want one of:

[更新:]获取“作为字符串之间的所有内容”我想你想要一个:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print etree.tostring(element)
    element.close()

or

要么

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print ''.join([etree.tostring(child) for child in element])
    element.close()

or perhaps even:

或者甚至是:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print ''.join([child.text for child in element])
    element.close()

#2


8  

For future searchers: The top answer here suggests clearing the element on each iteration, but that still leaves you with an ever-increasing set of empty elements that will slowly build up in memory:

对于未来的搜索者:这里的最佳答案建议在每次迭代时清除元素,但这仍然会给你留下一组不断增加的空元素,这些元素将在内存中慢慢积累:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    for child in element:
        print child.tag, child.text
    element.clear()

^ This is not a scalable solution, especially as your source file gets larger and larger. The better solution is to get the root element, and clear that every time you load a complete record. This will keep memory usage pretty stable (sub-20MB I would say).

^这不是可扩展的解决方案,尤其是当您的源文件变得越来越大时。更好的解决方案是获取根元素,并在每次加载完整记录时清除它。这将使内存使用保持相当稳定(我会说低于20MB)。

Here's a solution that doesn't require looking for a specific tag. This function will return a generator that yields all 1st child nodes (e.g. <BlogPost> elements) underneath the root node (e.g. <Database>). It does this by recording the start of the first tag after the root node, then waiting for the corresponding end tag, yielding the entire element, and then clearing the root node.

这是一个不需要查找特定标签的解决方案。此函数将返回一个生成器,该生成器生成根节点下的所有第一个子节点(例如 元素)(例如 )。它通过在根节点之后记录第一个标记的开头,然后等待相应的结束标记,产生整个元素,然后清除根节点来完成此操作。

from lxml import etree

xmlfile = '/path/to/xml/file.xml'

def iterate_xml(xmlfile):
    doc = etree.iterparse(xmlfile, events=('start', 'end'))
    _, root = next(doc)
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()

#3


4  

I prefer XPath for such things:

我喜欢XPath这样的东西:

In [1]: from lxml.etree import parse

In [2]: tree = parse('/tmp/database.xml')

In [3]: for post in tree.xpath('/Database/BlogPost'):
   ...:     print 'Author:', post.xpath('Author')[0].text
   ...:     print 'Content:', post.xpath('Content')[0].text
   ...: 
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.

I'm not sure if it's different in terms of processing big files, though. Comments about this would be appreciated.

不过,我不确定它在处理大文件方面是否有所不同。关于这一点的评论将不胜感激。

Doing it your way,

按自己的方式行事,

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
     for info in element.iter():
         if info.tag in ('Author', 'Content'):
             print info.tag, ':', info.text

#1


20  

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    for child in element:
        print child.tag, child.text
    element.clear()

the final clear will stop you from using too much memory.

最后的清除将阻止你使用太多的内存。

[update:] to get "everything between ... as a string" i guess you want one of:

[更新:]获取“作为字符串之间的所有内容”我想你想要一个:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print etree.tostring(element)
    element.close()

or

要么

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print ''.join([etree.tostring(child) for child in element])
    element.close()

or perhaps even:

或者甚至是:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print ''.join([child.text for child in element])
    element.close()

#2


8  

For future searchers: The top answer here suggests clearing the element on each iteration, but that still leaves you with an ever-increasing set of empty elements that will slowly build up in memory:

对于未来的搜索者:这里的最佳答案建议在每次迭代时清除元素,但这仍然会给你留下一组不断增加的空元素,这些元素将在内存中慢慢积累:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    for child in element:
        print child.tag, child.text
    element.clear()

^ This is not a scalable solution, especially as your source file gets larger and larger. The better solution is to get the root element, and clear that every time you load a complete record. This will keep memory usage pretty stable (sub-20MB I would say).

^这不是可扩展的解决方案,尤其是当您的源文件变得越来越大时。更好的解决方案是获取根元素,并在每次加载完整记录时清除它。这将使内存使用保持相当稳定(我会说低于20MB)。

Here's a solution that doesn't require looking for a specific tag. This function will return a generator that yields all 1st child nodes (e.g. <BlogPost> elements) underneath the root node (e.g. <Database>). It does this by recording the start of the first tag after the root node, then waiting for the corresponding end tag, yielding the entire element, and then clearing the root node.

这是一个不需要查找特定标签的解决方案。此函数将返回一个生成器,该生成器生成根节点下的所有第一个子节点(例如 元素)(例如 )。它通过在根节点之后记录第一个标记的开头,然后等待相应的结束标记,产生整个元素,然后清除根节点来完成此操作。

from lxml import etree

xmlfile = '/path/to/xml/file.xml'

def iterate_xml(xmlfile):
    doc = etree.iterparse(xmlfile, events=('start', 'end'))
    _, root = next(doc)
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()

#3


4  

I prefer XPath for such things:

我喜欢XPath这样的东西:

In [1]: from lxml.etree import parse

In [2]: tree = parse('/tmp/database.xml')

In [3]: for post in tree.xpath('/Database/BlogPost'):
   ...:     print 'Author:', post.xpath('Author')[0].text
   ...:     print 'Content:', post.xpath('Content')[0].text
   ...: 
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.

I'm not sure if it's different in terms of processing big files, though. Comments about this would be appreciated.

不过,我不确定它在处理大文件方面是否有所不同。关于这一点的评论将不胜感激。

Doing it your way,

按自己的方式行事,

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
     for info in element.iter():
         if info.tag in ('Author', 'Content'):
             print info.tag, ':', info.text