如何解析大型xml文件中的一些数据？

I need to extract the location and radius data from a large xml file that is formatted as below and store the data in 2-dimensional ndarray. This is my first time using Python and I can't find anything about the best way to do this.

我需要从大型xml文件中提取位置和半径数据,格式如下,并将数据存储在二维ndarray中。这是我第一次使用Python,但我找不到任何关于最佳方法的信息。

<species name="MyHeterotrophEPS" header="family,genealogy,generation,birthday,biomass,inert,capsule,growthRate,volumeRate,locationX,locationY,locationZ,radius,totalRadius">
0,0,0,0.0,0.0,0.0,77.0645361927206,-0.1001871531330136,-0.0013358287084401814,4.523853439106942,234.14575280979898,123.92820420047076,0.0,0.6259920275663835;
0,0,0,0.0,0.0,0.0,108.5705297969604,-0.1411462759900182,-0.001881950346533576,1.0429122163754276,144.1066875513379,72.24884428367467,0.0,0.7017581019907897;
.
.
.
</species>

Edit:I mean "large" by human standards. I am not having any memory issues with it.

编辑:我的意思是人类标准的“大”。我没有任何内存问题。

3 个解决方案

#1

You essentially have CSV data in the XML text value.

您基本上在XML文本值中包含CSV数据。

Use ElementTree to parse the XML, then use numpy.genfromtxt() to load that text into an array:

使用ElementTree解析XML,然后使用numpy.genfromtxt()将该文本加载到数组中:

from xml.etree import ElementTree as ET

tree = ET.parse('yourxmlfilename.xml')
species = tree.find(".//species[@name='MyHeterotrophEPS']")
names = species.attrib['header']
array = numpy.genfromtxt((line.rstrip(';') for line in species.text.splitlines()), 
    delimiter=',', names=names)

Note the generator expression, with a str.splitlines() call; this turns the text of the XML element into a sequence of lines, which .genfromtxt() is quite happy to receive. We do remove the trailing ; character from each line.

注意生成器表达式,使用str.splitlines()调用;这将XML元素的文本转换为一系列行,.genfromtxt()非常乐意接收。我们确实删除了尾随;每行的字符。

For your sample input (minus the . lines), this results in:

对于您的样本输入(减去。行),这会导致:

array([ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 77.0645361927206, -0.1001871531330136, -0.0013358287084401814, 4.523853439106942, 234.14575280979898, 123.92820420047076, 0.0, 0.6259920275663835),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 108.5705297969604, -0.1411462759900182, -0.001881950346533576, 1.0429122163754276, 144.1066875513379, 72.24884428367467, 0.0, 0.7017581019907897)], 
      dtype=[('family', '<f8'), ('genealogy', '<f8'), ('generation', '<f8'), ('birthday', '<f8'), ('biomass', '<f8'), ('inert', '<f8'), ('capsule', '<f8'), ('growthRate', '<f8'), ('volumeRate', '<f8'), ('locationX', '<f8'), ('locationY', '<f8'), ('locationZ', '<f8'), ('radius', '<f8'), ('totalRadius', '<f8')])

#2

If your XML is just that species node, it's pretty simple, and Martijn Pieters has already explained it better than I can.

如果您的XML只是那个物种节点,那么它非常简单,Martijn Pieters已经比我更好地解释了它。

But if you've got a ton of species nodes in the document, and it's too large to fit the whole thing into memory, you can use iterparse instead of parse:

但是如果文档中有大量的物种节点,并且它太大而无法将整个事物放入内存中,则可以使用iterparse而不是parse:

import numpy as np
import xml.etree.ElementTree as ET

for event, node in ET.iterparse('species.xml'):
    if node.tag == 'species':
        name = node.attr['name']
        names = node.attr['header']
        csvdata = (line.rstrip(';') for line in node.text.splitlines())
        array = np.genfromtxt(csvdata, delimiter=',', names=names)
        # do something with the array.

This won't help if you just have one super-gigantic species node, because even iterparse (or similar solutions like a SAX parser) parse one entire node at a time. You'd need to find an XML library that lets you stream the text of large nodes, and off the top of my head, I don't think of any stdlib or popular third-party parsers that can do that.

如果您只有一个超级巨型物种节点,这将无济于事,因为即使是非常稀疏(或像SAX解析器这样的类似解决方案)也会一次解析整个节点。您需要找到一个允许您流式传输大型节点文本的XML库,而且我不认为任何stdlib或流行的第三方解析器都能做到这一点。

#3

If the file is really large, use ElementTree or SAX.

如果文件非常大,请使用ElementTree或SAX。

If the file is not that large (i.e. fits into memory), minidom might be easier to work with.

如果文件不是那么大(即适合内存),minidom可能更容易使用。

Each line seems to be a simple string of comma-separated numbers, so you can sipmly do line.split(',').

每一行似乎是一个逗号分隔数字的简单字符串,所以你可以sipmly做line.split(',')。

#1