为什么Python 2。x抛出一个字符串格式+ unicode的异常?

I have the following piece of code. The last line throws an error. Why is that?

我有下面这段代码。最后一行抛出一个错误。这是为什么呢?

class Foo(object):

    def __unicode__(self):
        return u'\u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45'

    def __str__(self):
        return self.__unicode__().encode('utf-8')

print "this works %s" % (u'asdf')
print "this works %s" % (Foo(),)
print "this works %s %s" % (Foo(), 'asdf')
print

print "this also works {0} {1}".format(Foo(), u'asdf')
print
print "this should break %s %s" % (Foo(), u'asdf')

The error is "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 18: ordinal not in range(128)"

错误是“UnicodeDecodeError:‘ascii’编解码器无法解码位于位置18的字节0xe6:序数不在范围(128)”

1 个解决方案

#1

Python 2 implicitly will try and encode unicode values to strings when you mix unicode and string objects, or it will try and decode byte strings to unicode.

当您混合使用unicode和字符串对象时，Python 2将隐式地尝试将unicode值编码为字符串，或者尝试将字节字符串解码为unicode。

You are mixing unicode, byte strings and a custom object, and you are triggering a sequence of encodings and decodings that doesn't mix.

您正在混合unicode、字节字符串和自定义对象，您将触发一个不混合的编码和解码序列。

In this case, your Foo() value is interpolated as a string (str(Foo()) is used), and the u'asdf' interpolation triggers a decode of the template so far (so with the UTF-8 Foo() value) to interpolate the unicode string. This decode fails as the ASCII codec cannot decode the \xe6\x9e\x97 UTF-8 byte sequence already interpolated.

在本例中，您的Foo()值被插入为一个字符串(使用str(Foo())))，并且u'asdf'的插入触发模板的解码(使用UTF-8 Foo()值)来插入unicode字符串。这个解码失败了，因为ASCII码无法解码已经插入的\xe6\x9e\x97 UTF-8字节序列。

You should always explicitly encode Unicode values to bytestrings or decode byte strings to Unicode before mixing types, as the corner cases are complex.

在混合类型之前，应该始终显式地将Unicode值编码为bytestring或将字节字符串解码为Unicode，因为这种情况非常复杂。

Explicitly converting to unicode() works:

显式地转换为unicode()可以工作:

>>> print "this should break %s %s" % (unicode(Foo()), u'asdf')
this should break 林覺民謝冰心故居 asdf

as the output is turned into a unicode string:

当输出变成unicode字符串时:

>>> "this should break %s %s" % (unicode(Foo()), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

while otherwise you'd end up with a byte string:

否则你会得到一个字节串:

>>> "this should break %s %s" % (Foo(), 'asdf')
'this should break \xe6\x9e\x97\xe8\xa6\xba\xe6\xb0\x91\xe8\xac\x9d\xe5\x86\xb0\xe5\xbf\x83\xe6\x95\x85\xe5\xb1\x85 asdf'

(note that asdf is left a bytestring too).

(注意asdf也是一个字节字符串)。

Alternatively, use a unicode template:

或者，使用unicode模板:

>>> u"this should break %s %s" % (Foo(), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

#1