在Python中使用ETree保存XML。它不保留名称空间,添加ns0、ns1并删除xmlns标记

时间:2023-02-09 21:13:20

I see there are similar questions here, but nothing that has totally helped me. I've also looked at the official documentation on namespaces but can't find anything that is really helping me, perhaps I'm just too new at XML formatting. I understand that perhaps I need to create my own namespace dictionary? Either way, here is my situation:

我看到这里也有类似的问题,但没有什么能完全帮助我。我还查看了关于名称空间的官方文档,但是找不到真正帮助我的东西,也许我对XML格式太陌生了。我理解我是否需要创建自己的命名空间字典?不管怎样,我的情况是:

I am getting a result from an API call, it gives me an XML that is stored as a string in my Python application.

我从一个API调用中得到一个结果,它给了我一个XML,它作为字符串存储在我的Python应用程序中。

What I'm trying to accomplish is just grab this XML, swap out a tiny value (The b:string value user ConditionValue/Default but that's irrelevant to this question) and then save it as a string to send later on in a Rest POST call.

我要做的就是获取这个XML,交换一个很小的值(b:string value user ConditionValue/Default,但这与这个问题无关),然后将它保存为一个字符串,稍后在Rest POST调用中发送。

The source XML looks like this:

源XML是这样的:

<Context xmlns="http://Test.the.Sdk/2010/07" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<xmlns i:nil="true" xmlns="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:a="http://schema.test.org/2004/07/System.Xml.Serialize"/>
<Conditions xmlns:a="http://schema.test.org/2004/07/Test.Soa.Vocab">
    <a:Condition>
        <a:xmlns i:nil="true" xmlns:b="http://schema.test.org/2004/07/System.Xml.Serialize"/>
        <Identifier>a23aacaf-9b6b-424f-92bb-5ab71505e3bc</Identifier>
        <Name>Code</Name>
        <ParameterSelections/>
        <ParameterSetCollections/>
        <Parameters/>
        <Summary i:nil="true"/>
        <Instance>25486d6c-36ba-4ab2-9fa6-0dbafbcf0389</Instance>
        <ConditionValue>
            <ComplexValue i:nil="true"/>
            <Text i:nil="true" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
            <Default>
                <ComplexValue i:nil="true"/>
                <Text xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
                    <b:string>NULLCODE</b:string>
                </Text>
            </Default>
        </ConditionValue>
        <TypeCode>String</TypeCode>
    </a:Condition>
    <a:Condition>
        <a:xmlns i:nil="true" xmlns:b="http://schema.test.org/2004/07/System.Xml.Serialize"/>
        <Identifier>0af860f6-5611-4a23-96dc-eb3863975529</Identifier>
        <Name>Content Type</Name>
        <ParameterSelections/>
        <ParameterSetCollections/>
        <Parameters/>
        <Summary i:nil="true"/>
        <Instance>6364ec20-306a-4cab-aabc-8ec65c0903c9</Instance>
        <ConditionValue>
            <ComplexValue i:nil="true"/>
            <Text i:nil="true" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
            <Default>
                <ComplexValue i:nil="true"/>
                <Text xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
                    <b:string>Standard</b:string>
                </Text>
            </Default>
        </ConditionValue>
        <TypeCode>String</TypeCode>
    </a:Condition>
</Conditions>

My job is to swap out one of the values, retaining the entire structure of the source, and use this to submit a POST later on in the application.

我的工作是交换其中一个值,保留源的整个结构,然后使用它在应用程序的后面提交一个POST。

The problem that I am having is that when it saves to a string or to a file, it totally messes up the namespaces:

我遇到的问题是,当它保存到字符串或文件时,会完全打乱名称空间:

<ns0:Context xmlns:ns0="http://Test.the.Sdk/2010/07" xmlns:ns1="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:ns3="http://schemas.microsoft.com/2003/10/Serialization/Arrays" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:xmlns xsi:nil="true" />
<ns0:Conditions>
<ns1:Condition>
<ns1:xmlns xsi:nil="true" />
<ns0:Identifier>a23aacaf-9b6b-424f-92bb-5ab71505e3bc</ns0:Identifier>
<ns0:Name>Code</ns0:Name>
<ns0:ParameterSelections />
<ns0:ParameterSetCollections />
<ns0:Parameters />
<ns0:Summary xsi:nil="true" />
<ns0:Instance>25486d6c-36ba-4ab2-9fa6-0dbafbcf0389</ns0:Instance>
<ns0:ConditionValue>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text xsi:nil="true" />
<ns0:Default>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text>
<ns3:string>NULLCODE</ns3:string>
</ns0:Text>
</ns0:Default>
</ns0:ConditionValue>
<ns0:TypeCode>String</ns0:TypeCode>
</ns1:Condition>
<ns1:Condition>
<ns1:xmlns xsi:nil="true" />
<ns0:Identifier>0af860f6-5611-4a23-96dc-eb3863975529</ns0:Identifier>
<ns0:Name>Content Type</ns0:Name>
<ns0:ParameterSelections />
<ns0:ParameterSetCollections />
<ns0:Parameters />
<ns0:Summary xsi:nil="true" />
<ns0:Instance>6364ec20-306a-4cab-aabc-8ec65c0903c9</ns0:Instance>
<ns0:ConditionValue>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text xsi:nil="true" />
<ns0:Default>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text>
<ns3:string>Standard</ns3:string>
</ns0:Text>
</ns0:Default>
</ns0:ConditionValue>
<ns0:TypeCode>String</ns0:TypeCode>
</ns1:Condition>
</ns0:Conditions>

I've narrowed the code down to the most basic form and I'm still getting the same results so it's not anything to do with how I'm manipulating the file normally:

我已经把代码缩小到最基本的形式,我仍然得到相同的结果,这与我如何正常操作文件没有任何关系:

import xml.etree.ElementTree as ET
import requests

get_context_xml = 'http://localhost/testapi/returnxml' #returns first XML example above.
source_context_xml = requests.get(get_context_xml)

Tree = ET.fromstring(source_context_xml)

#Ensure the original namespaces are intact.
for Conditions in Tree.iter('{http://schema.test.org/2004/07/Test.Soa.Vocab}Condition'): 
    print "success"

with open('/home/memyself/output.xml','w') as f:
    f.write(ET.tostring(Tree))

2 个解决方案

#1


9  

You need to register the prefix and the namespace before you do fromstring() (Reading the xml) to avoid the default namespace prefixes (like ns0 and ns1 , etc.) .

在执行fromstring()(读取xml)之前,需要注册前缀和名称空间,以避免默认的名称空间前缀(如ns0和ns1等)。

You can use the ET.register_namespace() function for that, Example -

您可以为此使用etc .register_namespace()函数,例如—

ET.register_namespace('<prefix>','http://Test.the.Sdk/2010/07')
ET.register_namespace('a','http://schema.test.org/2004/07/Test.Soa.Vocab')

You can leave the <prefix> empty if you do not want a prefix.

如果不需要前缀,可以将 <前缀> 保留为空。


Example/Demo -

例子/演示-

>>> r = ET.fromstring('<a xmlns="blah">a</a>')
>>> ET.tostring(r)
b'<ns0:a xmlns:ns0="blah">a</ns0:a>'
>>> ET.register_namespace('','blah')
>>> r = ET.fromstring('<a xmlns="blah">a</a>')
>>> ET.tostring(r)
b'<a xmlns="blah">a</a>'

#2


0  

First off, welcome to the * network! Technically @anand-s-kumar is correct. However there was a minor misuse of the toString function, and the fact that namespaces might not always be known by the code or the same between tags or XML files. Also, inconsistencies between the lxml and xml.etree libraries and Python 2.x and 3.x make handling this difficult.

首先,欢迎来到*网络!技术上@anand-s-kumar是正确的。然而,toString函数有一个小的误用,并且名称空间可能并不总是由代码知道,也不总是由标记或XML文件知道。此外,lxml和xml之间的不一致。etree库和Python 2。x和3。这让处理起来很困难。

This function iterates through all of the children elements in the XML tree tree that is passed in, and then edits the XML tags to remove the namespaces. Note that by doing this, some data may be lost.

该函数迭代传递到的XML树树中的所有子元素,然后编辑XML标记以删除名称空间。注意,这样做可能会丢失一些数据。

def remove_namespaces(tree):
    for el in tree.getiterator():
        match = re.match("^(?:\{.*?\})?(.*)$", el.tag)
        if match:
            el.tag = match.group(1)

I myself just ran into this problem, and hacked together a quick solution. I tested this on about 81,000 XML files (averaging around 150 MB each) that had this problem, and all of them were fixed. Note that this isn't exactly an optimal solution, but it is relatively efficient and worked quite well for me.

我自己也遇到了这个问题,并想出了一个快速的解决方案。我在有这个问题的大约81,000个XML文件(平均每个文件约150mb)上测试了这个问题,所有这些文件都是固定的。请注意,这并不是一个最佳的解决方案,但是它是相对高效的,并且对我来说非常有效。

CREDIT: Idea and code structure originally from Jochen Kupperschmidt.

信贷:想法和代码结构最初来自于Jochen Kupperschmidt。

#1


9  

You need to register the prefix and the namespace before you do fromstring() (Reading the xml) to avoid the default namespace prefixes (like ns0 and ns1 , etc.) .

在执行fromstring()(读取xml)之前,需要注册前缀和名称空间,以避免默认的名称空间前缀(如ns0和ns1等)。

You can use the ET.register_namespace() function for that, Example -

您可以为此使用etc .register_namespace()函数,例如—

ET.register_namespace('<prefix>','http://Test.the.Sdk/2010/07')
ET.register_namespace('a','http://schema.test.org/2004/07/Test.Soa.Vocab')

You can leave the <prefix> empty if you do not want a prefix.

如果不需要前缀,可以将 <前缀> 保留为空。


Example/Demo -

例子/演示-

>>> r = ET.fromstring('<a xmlns="blah">a</a>')
>>> ET.tostring(r)
b'<ns0:a xmlns:ns0="blah">a</ns0:a>'
>>> ET.register_namespace('','blah')
>>> r = ET.fromstring('<a xmlns="blah">a</a>')
>>> ET.tostring(r)
b'<a xmlns="blah">a</a>'

#2


0  

First off, welcome to the * network! Technically @anand-s-kumar is correct. However there was a minor misuse of the toString function, and the fact that namespaces might not always be known by the code or the same between tags or XML files. Also, inconsistencies between the lxml and xml.etree libraries and Python 2.x and 3.x make handling this difficult.

首先,欢迎来到*网络!技术上@anand-s-kumar是正确的。然而,toString函数有一个小的误用,并且名称空间可能并不总是由代码知道,也不总是由标记或XML文件知道。此外,lxml和xml之间的不一致。etree库和Python 2。x和3。这让处理起来很困难。

This function iterates through all of the children elements in the XML tree tree that is passed in, and then edits the XML tags to remove the namespaces. Note that by doing this, some data may be lost.

该函数迭代传递到的XML树树中的所有子元素,然后编辑XML标记以删除名称空间。注意,这样做可能会丢失一些数据。

def remove_namespaces(tree):
    for el in tree.getiterator():
        match = re.match("^(?:\{.*?\})?(.*)$", el.tag)
        if match:
            el.tag = match.group(1)

I myself just ran into this problem, and hacked together a quick solution. I tested this on about 81,000 XML files (averaging around 150 MB each) that had this problem, and all of them were fixed. Note that this isn't exactly an optimal solution, but it is relatively efficient and worked quite well for me.

我自己也遇到了这个问题,并想出了一个快速的解决方案。我在有这个问题的大约81,000个XML文件(平均每个文件约150mb)上测试了这个问题,所有这些文件都是固定的。请注意,这并不是一个最佳的解决方案,但是它是相对高效的,并且对我来说非常有效。

CREDIT: Idea and code structure originally from Jochen Kupperschmidt.

信贷:想法和代码结构最初来自于Jochen Kupperschmidt。