如何解析表示xml.dom的字符串。minidom节点在python中?

时间:2022-10-26 20:56:52

I have a collection of nodes xml.dom.Node objects created using xml.dom.minidom. I store them (individually) in a database by converting them to a string using the toxml() method of a the Node object.

我有一个节点xml.dom的集合。使用xml.dom.minidom创建的节点对象。通过使用节点对象的toxml()方法将它们转换为字符串,我(单独地)将它们存储在数据库中。

The problem is that I'd sometimes like to be able to convert them back to the appropriate Node object using a parser of some kind. As far as I can see the various libraries shipped with python use Expat which won't parse a string like '' or indeed anything which is not a correct xml string.

问题是,我有时希望能够使用某种解析器将它们转换回适当的节点对象。就我所见,python附带的各种库都使用Expat,它不会解析“或任何不是正确的xml字符串”之类的字符串。

So, does anyone have any ideas? I realise I could pickle the nodes in some way and then unpickle them, but that feels unpleasant and I'd much rather be storing in a form I can read for maintenance purposes. Surely there is something that will do this?

那么,大家有什么想法吗?我意识到,我可以用某种方式对节点进行pickle,然后再对它们进行unpickle,但这让人感到不愉快,我更希望以一种可以用于维护的方式进行读取的形式进行存储。肯定有什么东西能做到这一点吗?

In response to the doubt expressed that this is possible, an example of what I mean:

在回答关于这是可能的疑问时,我指的是:

>>> import xml.dom.minidom
>>> x=xml.dom.minidom.parseString('<a>foo<b>thing</b></a>')
>>> x.documentElement.childNodes[0]
<DOM Text node "u'foo'">
>>> x.documentElement.childNodes[0].toxml()
u'foo'
>>> xml.dom.minidom.parseString(x.documentElement.childNodes[0].toxml())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
    return expatbuilder.parseString(string)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0

In other words the ".toxml()" method does not create something that Expat (and hence out of the box parseString) will parse.

换句话说,“.toxml()”方法并没有创建Expat(也就是从这个box parseString中取出的)来解析的东西。

What I would like is something that will parse u'foo' into a text node. I.e. something that will reverse the effect of .toxml()

我想要的是将u'foo'解析为一个文本节点的东西。也就是说,它可以逆转.toxml()的影响

2 个解决方案

#1


2  

What types of node do you need to store?

您需要存储什么类型的节点?

Obviously Element nodes should just work if serialised with .toxml('utf-8'); the results should be parseable as an XML document as-is and the element retrievable from documentElement, as long as there are no EntityReferences inside it that would need definition in the doctype.

显然,如果使用.toxml('utf-8')序列化,元素节点应该能够正常工作;只要文档中没有实体引用需要在doctype中定义,那么结果应该可以作为XML文档进行解析,元素可以从documentElement中检索。

Text nodes, on the other hand, would need either HTML-decoding or some wrapping to parse. If you only needed elements and text nodes you could guess whether it was an element from the first character, since that must always be < for an element:

另一方面,文本节点需要html解码或一些包装才能进行解析。如果您只需要元素和文本节点,您可以从第一个字符中猜测它是否是一个元素,因为对于一个元素,它必须始终为<:

var xml= node.toxml('utf-8')

...

if (xml.startswith('<')):
    node= minidom.parseString(xml).documentElement
else:
    node= minidom.parseString('<x>%s</x>'%xml).documentElement.firstChild

Comment nodes could similarly be stored by checking for <!--.

类似地,可以通过检查

Other node types like Attr would be more work since their XML representation is not easily distinguishable from Text. You would probably need to store an out-of-band nodeType value to remember it. OTOH minidom doesn't implement toxml() on Attr anyway so maybe that's not an issue.

像Attr这样的其他节点类型将会更多的工作,因为它们的XML表示很难与文本区分开来。您可能需要存储一个带外nodeType值来记住它。OTOH minidom不会在Attr上实现toxml(),所以这可能不是问题。

#2


3  

from xml.dom.minidom import parseString

try:
  node = parseString('') 
except Exception:
  node = None

#1


2  

What types of node do you need to store?

您需要存储什么类型的节点?

Obviously Element nodes should just work if serialised with .toxml('utf-8'); the results should be parseable as an XML document as-is and the element retrievable from documentElement, as long as there are no EntityReferences inside it that would need definition in the doctype.

显然,如果使用.toxml('utf-8')序列化,元素节点应该能够正常工作;只要文档中没有实体引用需要在doctype中定义,那么结果应该可以作为XML文档进行解析,元素可以从documentElement中检索。

Text nodes, on the other hand, would need either HTML-decoding or some wrapping to parse. If you only needed elements and text nodes you could guess whether it was an element from the first character, since that must always be < for an element:

另一方面,文本节点需要html解码或一些包装才能进行解析。如果您只需要元素和文本节点,您可以从第一个字符中猜测它是否是一个元素,因为对于一个元素,它必须始终为<:

var xml= node.toxml('utf-8')

...

if (xml.startswith('<')):
    node= minidom.parseString(xml).documentElement
else:
    node= minidom.parseString('<x>%s</x>'%xml).documentElement.firstChild

Comment nodes could similarly be stored by checking for <!--.

类似地,可以通过检查

Other node types like Attr would be more work since their XML representation is not easily distinguishable from Text. You would probably need to store an out-of-band nodeType value to remember it. OTOH minidom doesn't implement toxml() on Attr anyway so maybe that's not an issue.

像Attr这样的其他节点类型将会更多的工作,因为它们的XML表示很难与文本区分开来。您可能需要存储一个带外nodeType值来记住它。OTOH minidom不会在Attr上实现toxml(),所以这可能不是问题。

#2


3  

from xml.dom.minidom import parseString

try:
  node = parseString('') 
except Exception:
  node = None