读取UTF-8 XML并使用Python将其写入文件

I'm trying to parse UTF-8 XML file and save some parts of it to another file. Problem is, that this is my first Python script ever and I'm totally confused about the character encoding problems I'm finding.

我正在尝试解析UTF-8 XML文件并将其中的一些部分保存到另一个文件中。问题是,这是我的第一个Python脚本,我对我发现的字符编码问题感到困惑。

My script fails immediately when it tries to write non-ascii character to a file, but it can print it to command prompt (at least in some level)

我的脚本在尝试将非ascii字符写入文件时立即失败,但它可以将其打印到命令提示符(至少在某个级别)

Here's the XML (from the parts that matter at least, it's a *.resx file which contains UI strings)

这是XML(至少重要的部分,它是包含UI字符串的* .resx文件)

<?xml version="1.0" encoding="utf-8"?>
<root>
     <resheader name="foo">
          <value>bar</value>
     </resheader>
     <data name="lorem" xml:space="preserve">
          <value>ipsum öä</value>
     </data>
</root>

And here's my python script

这是我的python脚本

from xml.dom.minidom import parse

names = []
values = []

def getStrings(path):
    dom = parse(path)
    data = dom.getElementsByTagName("data")

    for i in range(len(data)):
        name = data[i].getAttribute("name")
        names.append(name)
        value = data[i].getElementsByTagName("value")
        values.append(value[0].firstChild.nodeValue.encode("utf-8"))

def writeToFile():
    with open("uiStrings-fi.py", "w") as f:
        for i in range(len(names)):
            line = names[i] + '="'+ values[i] + '"' #varName='varValue'
            f.write(line)
            f.write("\n")

getStrings("ResourceFile.fi-FI.resx")
writeToFile()

And here's the traceback:

这是追溯:

Traceback (most recent call last):
  File "GenerateLanguageFiles.py", line 24, in 
    writeToFile()
  File "GenerateLanguageFiles.py", line 19, in writeToFile
    line = names[i] + '="'+ values[i] + '"' #varName='varValue'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in ran
ge(128)

How should I fix my script so it would read and write UTF-8 characters properly? The files I'm trying to generate would be used in test automation with Robots Framework.

我应该如何修复我的脚本,以便它能正确读写UTF-8字符?我正在尝试生成的文件将用于Robots Framework的测试自动化。

2 个解决方案

#1

You'll need to remove the call to encode() - that is, replace nodeValue.encode("utf-8") with nodeValue - and then change the call to open() to

你需要删除对encode()的调用 - 也就是说,用nodeValue替换nodeValue.encode(“utf-8”) - 然后将调用更改为open()

with open("uiStrings-fi.py", "w", "utf-8") as f:

This uses a "Unicode-aware" version of open() which you will need to import from the codecs module, so also add

这使用open()的“Unicode-aware”版本,您需要从编解码器模块导入,所以也添加

from codecs import open

to the top of the file.

到文件的顶部。

The issue is that when you were calling nodeValue.encode("utf-8"), you were converting a Unicode string (Python's internal representation that can store all Unicode characters) into a regular string (which can only store single-byte characters 0-255). Later on, when you construct the line to write to the output file, names[i] is still a Unicode string but values[i] is a regular string. Python tries to convert the regular string to Unicode, which is the more general type, but because you don't specify an explicit conversion, it uses the ASCII codec, which is the default, and ASCII can't handle characters with byte values greater than 127. Unfortunately, several of those do occur in the string values[i] because the UTF-8 encoding uses those upper-range bytes frequently. So Python complains that it sees a character it can't handle. The solution, as I said above, is to defer the conversion from Unicode to bytes until the last possible moment, and you do that by using the Unicode-aware version of open (which will handle the encoding for you).

问题是,当您调用nodeValue.encode(“utf-8”)时,您将Unicode字符串(Python的内部表示,可以存储所有Unicode字符)转换为常规字符串(只能存储单字节字符0) -255)。稍后,当您构造要写入输出文件的行时,names [i]仍然是Unicode字符串,但values [i]是常规字符串。 Python尝试将常规字符串转换为Unicode,这是更通用的类型,但由于您没有指定显式转换,它使用ASCII编解码器,这是默认值,而ASCII无法处理字节值更大的字符不幸的是,其中一些确实出现在字符串值[i]中,因为UTF-8编码经常使用那些高范围字节。所以Python抱怨说它看到了一个它无法处理的角色。正如我上面所说的那样,解决方案是将转换从Unicode推迟到字节,直到最后一刻,然后通过使用支持Unicode的版本open(它将为您处理编码)来实现。

Now that I think about it, instead of what I said above, an alternate solution would be to replace names[i] with names[i].encode("utf-8"). That way, you convert names[i] into a regular string as well, and Python has no reason to try to convert values[i] back to Unicode. Although, one could make the argument that it's good practice to keep your strings as Unicode objects until you write them out to the file... if nothing else, I believe unicode becomes the default in Python 3.

现在我考虑一下,而不是我上面所说的,另一种解决方案是用名称[i] .encode(“utf-8”)替换名称[i]。这样,您也可以将names [i]转换为常规字符串,而Python没有理由尝试将值[i]转换回Unicode。尽管如此,人们可以认为将字符串保留为Unicode对象是好的做法,直到将它们写入文件为止......如果没有别的,我相信unicode成为Python 3中的默认设置。

#2

The XML parser decodes the UTF-8 encoding of the input when it reads the file and all the text nodes and attributes of the resulting DOM are then unicode objects. When you select the interesting data from the DOM, you re-encode the values as UTF-8, but you don't encode the names. The resulting values array contains encoded byte strings while the names array still contains unicode objects.

XML解析器在读取文件时解码输入的UTF-8编码,并且生成的DOM的所有文本节点和属性都是unicode对象。当您从DOM中选择有趣的数据时,您将值重新编码为UTF-8,但不对名称进行编码。结果值数组包含编码的字节字符串,而names数组仍包含unicode对象。

In the line where the encoding error is thrown, Python tries to concatenate such a unicode name and a byte string value. To do so, both values have to be of the same type and Python tries to convert the byte string values[i] to unicode, but it doesn't know that it's UTF-8 encoded and fails when it tries to use the ASCII codec.

在抛出编码错误的行中,Python尝试连接这样的unicode名称和字节字符串值。为此,两个值必须是相同的类型,Python尝试将字节字符串值[i]转换为unicode,但它不知道它是UTF-8编码并且在尝试使用ASCII编解码器时失败。

The easiest way to work around this would be to keep all the strings as Unicode objects and just encode them to UTF-8 when they are written to the file:

解决这个问题的最简单方法是将所有字符串保留为Unicode对象,并在将它们写入文件时将它们编码为UTF-8:

values.append(value[0].firstChild.nodeValue) # encode not yet
...
f.write(line.encode('utf-8')) # but now

#1