Python 2.7解码错误使用UTF-8报头:UnicodeDecodeError: 'ascii' codec不能解码字节0xc3。

Traceback:

回溯:

Traceback (most recent call last):
  File "venues.py", line 22, in <module>
    main()
  File "venues.py", line 19, in main
    print_category(category, 0)
  File "venues.py", line 13, in print_category
    print_category(subcategory, ident+1)
  File "venues.py", line 10, in print_category
    print u'%s: %s' % (category['name'].encode('utf-8'), category['id'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Code:

代码:

# -*- coding: utf-8 -*-

# Using https://github.com/marcelcaraciolo/foursquare
import foursquare 

# Prints categories and subcategories
def print_category(category, ident):
    for i in range(0,ident):
        print u'\t',
    print u'%s: %s' % (category['name'].encode('utf-8'), category['id'])

    for subcategory in category.get('categories', []):
        print_category(subcategory, ident+1)

def main():
    client = foursquare.Foursquare(client_id='id',
                                   client_secret='secret')
    for category in client.venues.categories()['categories']:
        print_category(category, 0)

if __name__ == '__main__':
    main()

2 个解决方案

#1

The trick is, keep all your string processing in the source completely Unicode. Decode to Unicode when reading input (files/pipes/console) and encode when writing output. If category['name'] is Unicode, keep it that way (remove `.encode('utf8').

诀窍是，将所有字符串处理保持在源完全Unicode中。在读取输入(文件/管道/控制台)时解码到Unicode，并在写入输出时进行编码。如果类别['name']是Unicode，那么保持它(删除' .encode('utf8'))。

Also Per your comment:

每你的评论:

However, the error still occurs when I try to do: python venues.py > categories.txt, but not when output goes to the terminal: python venues.py

然而，当我尝试去做:python的场所时，仍然会出现错误。py >类别。txt，但当输出端到终端时，不是:python venues.py。

Python can usually determine the terminal encoding and will automatically encode to that encoding, which is why writing to the terminal works. If you use shell redirection to output to a file, you need to tell Python the I/O encoding you want via an environment variable, for example:

Python通常可以决定终端编码，并将自动编码到该编码，这就是为什么写入终端工作。如果您使用shell重定向输出到一个文件，您需要通过环境变量告诉Python您想要的I/O编码，例如:

set PYTHONIOENCODING=utf8
python venues.py > categories.txt

Working example, using my US Windows console that uses cp437 encoding. The source code is saved in "UTF-8 without BOM". It's worth pointing out that the source code bytes are UTF-8, but declaring the source encoding and using a Unicode string in allows Python to decode the source correctly, and encode the print output automatically to the terminal using its default encoding

工作示例，使用我的美国Windows控制台，使用cp437编码。源代码保存在“没有BOM的UTF-8”中。值得指出的是，源代码字节是UTF-8，但是声明源编码和使用Unicode字符串允许Python正确地解码源代码，并使用其默认编码将打印输出自动编码到终端。

#coding:utf8
import sys
print sys.stdout.encoding
print u'üéâäàåçêëèïîì'

Here Python uses the default terminal encoding, but when redirected, does not know what the encoding is, so defaults to ascii:

这里Python使用默认的终端编码，但是当重定向时，不知道编码是什么，所以默认为ascii:

C:\>python example.py
cp437
üéâäàåçêëèïîì

C:\>python example.py >out.txt
Traceback (most recent call last):
  File "example.py", line 4, in <module>
    print u'├╝├⌐├ó├ñ├á├Ñ├º├¬├½├¿├»├«├¼'
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-12: ordinal not in range(128)

C:\>type out.txt
None

Since we're using shell redirection, use a shell variable to tell Python what encoding to use:

由于我们使用的是shell重定向，所以使用shell变量来告诉Python要使用什么编码:

C:\>set PYTHONIOENCODING=cp437

C:\>python example.py >out.txt

C:\>type out.txt
cp437
üéâäàåçêëèïîì

We can also force Python to use another encoding, but in this case the terminal doesn't know how to display UTF-8. The terminal is still decoding the bytes in the file using cp437:

我们还可以强制Python使用另一种编码，但在这种情况下，终端不知道如何显示UTF-8。终端仍在使用cp437解码文件中的字节:

C:\>set PYTHONIOENCODING=utf8

C:\>python example.py >out.txt

C:\>type out.txt
utf8
├╝├⌐├ó├ñ├á├Ñ├º├¬├½├¿├»├«├¼

#2

I'm not sure, but I think the culprit is the "u" character at the start of u"%s: %s". This is assuming that what you want to print is a byte string and not a unicode string --- which would be reasonable(*): you output bytes, suitably encoded. Modified like this:

我不确定，但我认为罪魁祸首是u“%s: %s”开头的“u”字符。这是假设您想要打印的是一个字节字符串而不是unicode字符串——这是合理的(*):您输出字节，适当编码。修改如下:

print '%s: %s' % (category['name'].encode('utf-8'), category['id'])

this would turn the unicode string category['name'] to a UTF-8 byte string, and then the rest of the processing is done with byte strings.

这将把unicode字符串类别['name']转换为UTF-8字节字符串，然后处理的其余部分用字节字符串完成。

(*) It is reasonable in one point of view; another point of view is to print unicode strings and let the environment decide how it should be encoded, but then you're at the mercy of several factors that you don't really control. That's why you see differences between the output going to the terminal or to a file. To avoid all these issues, just print byte strings.

(*)在一个观点上是合理的;另一种观点是打印unicode字符串，让环境决定如何编码，但是你会受到一些你无法真正控制的因素的支配。这就是为什么您会看到输出到终端或文件之间的差异。要避免所有这些问题，只需打印字节字符串。

#1