使用Python和minidom进行XML解析。

时间:2021-02-05 00:13:27

I'm using Python (minidom) to parse an XML file that prints a hierarchical structure that looks something like this (indentation is used here to show the significant hierarchical relationship):

我正在使用Python (minidom)来解析一个XML文件,该文件打印的层次化结构看起来像这样(这里使用缩进来显示重要的层次化关系):

My Document
Overview
    Basic Features
    About This Software
        Platforms Supported

Instead, the program iterates multiple times over the nodes and produces the following, printing duplicate nodes. (Looking at the node list at each iteration, it's obvious why it does this but I can't seem to find a way to get the node list I'm looking for.)

相反,程序在节点上进行多次迭代,并生成以下内容,打印重复的节点。(在每次迭代中查看节点列表,很明显它会这样做,但我似乎找不到获得我正在查找的节点列表的方法。)

My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported

Here is the XML source file:

下面是XML源文件:

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

Here is the Python program:

以下是Python程序:

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
    alist=node.getElementsByTagName('Title')
    for a in alist:
        Title= a.firstChild.data
        print Title

I could fix the problem by not nesting 'Topic' elements, by changing the lower level topic names to something like 'SubTopic1' and 'SubTopic2'. But, I want to take advantage of built-in XML hierarchical structuring without needing different element names; it seems that I should be able to nest 'Topic' elements and that there should be some way to know which level 'Topic' I'm currently looking at.

我可以通过将低级主题名称改为“SubTopic1”和“SubTopic2”之类的方式来解决问题,而不是嵌套“主题”元素。但是,我想利用内置的XML分层结构,而不需要不同的元素名称;似乎我应该能够嵌套“主题”元素,并且应该有某种方法来知道我当前关注的是哪个级别的“主题”。

I've tried a number of different XPath functions without much success.

我尝试过许多不同的XPath函数,但都没有成功。

5 个解决方案

#1


8  

getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.

getElementsByTagName是递归的,您将获得所有具有匹配tagName的后代。因为您的主题包含其他也有标题的主题,所以调用将多次获得下拉标题。

If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:

如果您想只要求所有匹配的直接子元素,并且没有XPath可用,您可以编写一个简单的过滤器,例如:

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

#2


7  

Let me put that comment here ...

让我把评论写在这里……

Thanks for the attempt. It didn't work but it gave me some ideas. The following works (the same general idea; FWIW, the nodeType is ELEMENT_NODE):

谢谢你的尝试。虽然没有成功,但它给了我一些想法。以下作品(相同的大意;FWIW,节点类型为ELEMENT_NODE):

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("docmap.xml")

def getChildrenByTitle(node):
    for child in node.childNodes:
        if child.localName=='Title':
            yield child

Topic=dom.getElementsByTagName('Topic')
for node in Topic:
    alist=getChildrenByTitle(node)
    for a in alist:
#        Title= a.firstChild.data
        Title= a.childNodes[0].nodeValue
        print Title

#3


3  

You could use the following generator to run through the list and get titles with indentation levels:

您可以使用以下生成器浏览列表并获得缩进级别的标题:

def f(elem, level=-1):
    if elem.nodeName == "Title":
        yield elem.childNodes[0].nodeValue, level
    elif elem.nodeType == elem.ELEMENT_NODE:
        for child in elem.childNodes:
            for e, l in f(child, level + 1):
                yield e, l

If you test it with your file:

如果你用你的文件测试它:

import xml.dom.minidom as minidom
doc = minidom.parse("test.xml")
list(f(doc))

you will get a list with the following tuples:

您将得到一个列表,其中有以下元组:

(u'My Document', 1), 
(u'Overview', 1), 
(u'Basic Features', 2), 
(u'About This Software', 2), 
(u'Platforms Supported', 3)

It is only a basic idea to be fine-tuned of course. If you just want spaces at the beginning you can code that directly in the generator, though with the level you have more flexibility. You could also detect the first level automatically (here it's just a poor job of initializing the level to -1...).

当然,这只是一个需要微调的基本想法。如果您只是想在开始时使用空格,您可以直接在生成器中编写代码,不过在级别上您有更大的灵活性。您还可以自动检测第一个级别(这里只是将级别初始化到-1…)。

#4


2  

I think that can help

我想这是有帮助的。

import os
import sys
import subprocess
import base64,xml.dom.minidom
from xml.dom.minidom import Node
f = open("file.xml",'r')
data = f.read()
i = 0
doc = xml.dom.minidom.parseString(data)
for topic in doc.getElementsByTagName('Topic'):
   title= doc.getElementsByTagName('Title')[i].firstChild.nodeValue
   print title
   i +=1

Output:

输出:

My Document
Overview
Basic Features
About This Software
Platforms Supported

#5


1  

Recusive function:

Recusive功能:

import xml.dom.minidom

def traverseTree(document, depth=0):
  tag = document.tagName
  for child in document.childNodes:
    if child.nodeType == child.TEXT_NODE:
      if document.tagName == 'Title':
        print depth*'    ', child.data
    if child.nodeType == xml.dom.Node.ELEMENT_NODE:
      traverseTree(child, depth+1)

filename = 'sample.xml'
dom = xml.dom.minidom.parse(filename)
traverseTree(dom.documentElement)

Your xml:

xml:

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

Your desired output:

你想要的输出:

 $ python parse_sample.py 
      My Document
      Overview
          Basic Features
          About This Software
              Platforms Supported

#1


8  

getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.

getElementsByTagName是递归的,您将获得所有具有匹配tagName的后代。因为您的主题包含其他也有标题的主题,所以调用将多次获得下拉标题。

If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:

如果您想只要求所有匹配的直接子元素,并且没有XPath可用,您可以编写一个简单的过滤器,例如:

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

#2


7  

Let me put that comment here ...

让我把评论写在这里……

Thanks for the attempt. It didn't work but it gave me some ideas. The following works (the same general idea; FWIW, the nodeType is ELEMENT_NODE):

谢谢你的尝试。虽然没有成功,但它给了我一些想法。以下作品(相同的大意;FWIW,节点类型为ELEMENT_NODE):

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("docmap.xml")

def getChildrenByTitle(node):
    for child in node.childNodes:
        if child.localName=='Title':
            yield child

Topic=dom.getElementsByTagName('Topic')
for node in Topic:
    alist=getChildrenByTitle(node)
    for a in alist:
#        Title= a.firstChild.data
        Title= a.childNodes[0].nodeValue
        print Title

#3


3  

You could use the following generator to run through the list and get titles with indentation levels:

您可以使用以下生成器浏览列表并获得缩进级别的标题:

def f(elem, level=-1):
    if elem.nodeName == "Title":
        yield elem.childNodes[0].nodeValue, level
    elif elem.nodeType == elem.ELEMENT_NODE:
        for child in elem.childNodes:
            for e, l in f(child, level + 1):
                yield e, l

If you test it with your file:

如果你用你的文件测试它:

import xml.dom.minidom as minidom
doc = minidom.parse("test.xml")
list(f(doc))

you will get a list with the following tuples:

您将得到一个列表,其中有以下元组:

(u'My Document', 1), 
(u'Overview', 1), 
(u'Basic Features', 2), 
(u'About This Software', 2), 
(u'Platforms Supported', 3)

It is only a basic idea to be fine-tuned of course. If you just want spaces at the beginning you can code that directly in the generator, though with the level you have more flexibility. You could also detect the first level automatically (here it's just a poor job of initializing the level to -1...).

当然,这只是一个需要微调的基本想法。如果您只是想在开始时使用空格,您可以直接在生成器中编写代码,不过在级别上您有更大的灵活性。您还可以自动检测第一个级别(这里只是将级别初始化到-1…)。

#4


2  

I think that can help

我想这是有帮助的。

import os
import sys
import subprocess
import base64,xml.dom.minidom
from xml.dom.minidom import Node
f = open("file.xml",'r')
data = f.read()
i = 0
doc = xml.dom.minidom.parseString(data)
for topic in doc.getElementsByTagName('Topic'):
   title= doc.getElementsByTagName('Title')[i].firstChild.nodeValue
   print title
   i +=1

Output:

输出:

My Document
Overview
Basic Features
About This Software
Platforms Supported

#5


1  

Recusive function:

Recusive功能:

import xml.dom.minidom

def traverseTree(document, depth=0):
  tag = document.tagName
  for child in document.childNodes:
    if child.nodeType == child.TEXT_NODE:
      if document.tagName == 'Title':
        print depth*'    ', child.data
    if child.nodeType == xml.dom.Node.ELEMENT_NODE:
      traverseTree(child, depth+1)

filename = 'sample.xml'
dom = xml.dom.minidom.parse(filename)
traverseTree(dom.documentElement)

Your xml:

xml:

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

Your desired output:

你想要的输出:

 $ python parse_sample.py 
      My Document
      Overview
          Basic Features
          About This Software
              Platforms Supported