python 2和3中的UTF-8字符串

时间:2023-01-05 23:12:33

The following code works in Python 3:

以下代码适用于Python 3:

people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

And produces the following output:

并产生以下输出:

Nicholas Gyeney, André  
Writers: Nicholas Gyeney, André

In Python 2.7, though, I get the following error:

但是,在Python 2.7中,我收到以下错误:

Traceback (most recent call last):
  File "python", line 4, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 21: ordinal not in range(128)

I can fix this error by changing ", ".join(people) to ", ".join(people).encode('utf-8'), but if I do so, the output in Python 3 changes to:

我可以通过将“,”。join(people)更改为“,”。join(people).encode('utf-8')来修复此错误,但如果我这样做,Python 3中的输出将更改为:

b'Nicholas Gyeney, Andr\xc3\xa9'  
Writers: b'Nicholas Gyeney, Andr\xc3\xa9'

So I tried to use the following code:

所以我尝试使用以下代码:

if sys.version_info < (3, 0):
    reload(sys)
    sys.setdefaultencoding('utf-8')

people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

Which makes my code work in all versions of Python. But I read that using setdefaultencoding is discouraged.

这使我的代码适用于所有版本的Python。但我读到不建议使用setdefaultencoding。

What's the best approach to deal with this issue?

处理这个问题的最佳方法是什么?

4 个解决方案

#1


2  

First we assume that you want to support Python 2.7 and 3.5 versions (2.6 and 3.0 to 3.2 are handled a bit differently).

首先,我们假设您要支持Python 2.7和3.5版本(2.6和3.0到3.2的处理方式略有不同)。

As you have already read, setdefaultencoding is discouraged and actually not needed in your case.

正如您已经读过的那样,不建议使用setdefaultencoding,实际上并不需要。

To write cross platform code dealing with unicode text, you generally only need to specify string encoding at several places:

要编写处理unicode文本的跨平台代码,通常只需要在几个地方指定字符串编码:

  1. At top of your script, below the shebang with # -*- coding: utf-8 -*- (only if you have string literals with unicode text in your code)
  2. 在脚本的顶部,在shebang下方使用# - * - coding:utf-8 - * - (仅当您的代码中包含带有unicode文本的字符串文字时)
  3. When you read input data (eg. from text file or database)
  4. 当您读取输入数据时(例如,来自文本文件或数据库)
  5. When you output data (again from text file or database)
  6. 输出数据时(再次从文本文件或数据库)
  7. When you define a string literal in code
  8. 在代码中定义字符串文字时

Here is how I changed your example by following those rules:

以下是我按照这些规则更改示例的方法:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

people = ['Nicholas Gyeney', 'André']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

print(type(writers))
print(len(writers))

which outputs:

哪个输出:

<type 'str'>
23

Here is what changed:

这是改变了什么:

  • Specified file encoding at top of file
  • 在文件顶部指定文件编码
  • Replaced \xe9 with the actual Unicode character (é)
  • 用实际的Unicode字符替换\ xe9(é)
  • Removed u prefixes
  • 删除了你的前缀

It works just nicely in Python 2.7.12 and 3.5.2.

它在Python 2.7.12和3.5.2中运行得很好。

But be warned that removing the u prefixes will make python use regular str type instead of unicode (see output of print(type(writers))). In case of utf-8 it works in most places as if it were a unicode string, but when checking the text length a wrong value will be returned. In this example len returns 23, where the actual number of characters is 22. This is because the underlying type is str, which counts each byte as a character, but character é should actually be two bytes.

但请注意,删除u前缀将使python使用常规str类型而不是unicode(请参阅print(type(writers))的输出)。在utf-8的情况下,它在大多数地方工作,就好像它是一个unicode字符串,但在检查文本长度时,将返回错误的值。在这个例子中,len返回23,其中实际的字符数是22.这是因为底层类型是str,它将每个字节计为一个字符,但字符é实际上应该是两个字节。

In other words this works when outputing data fine (as in your example), but not if you want to do string manipulation on the text. In this case, you still need to use the u prefix or convert the data to unicode type excplicitly, before string manipulation.

换句话说,这在输出数据时很有效(如在您的示例中),但如果您想对文本执行字符串操作则不行。在这种情况下,您仍需要使用u前缀或在字符串操作之前将数据明确地转换为unicode类型。

So, if it was not for your simple example, it would be better to still use the u prefix. You need that in two places:

所以,如果不是你的简单例子,那么仍然使用u前缀会更好。你需要在两个地方:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

people = [u'Nicholas Gyeney', u'André']
writers = ", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))

print(type(writers))
print(len(writers))

which outputs:

哪个输出:

<type 'unicode'>
22

Note: u prefix was removed in Python 3.0 and then reintroduced again in Python 3.3 for backward compatibility.

注意:在Python 3.0中删除了u前缀,然后在Python 3.3中再次重新引入,以实现向后兼容。

Detailed explanation of all intricacies of working with unicode text in Python 2 is available in official documentation: Python 2 - Unicode HOWTO.

有关在Python 2中使用unicode文本的所有复杂性的详细说明,请参阅官方文档:Python 2 - Unicode HOWTO。

Here is an excerpt for the special comment specifying file encoding:

以下是指定文件编码的特殊注释的摘录:

Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

Python支持在任何编码中编写Unicode文字,但您必须声明正在使用的编码。这是通过将特殊注释包含在源文件的第一行或第二行来完成的:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = u'abcdé' print ord(u[-1])

The syntax is inspired by Emacs’s notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports coding. The -*- symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention. Python looks for coding: name or coding=name in the comment.

语法的灵感来自于Emacs用于指定文件本地变量的表示法。 Emacs支持许多不同的变量,但Python只支持编码。 - * - 符号表示Emacs评论是特殊的;它们对Python没有意义,但却是一种惯例。 Python在注释中查找编码:name或coding = name。

If you don’t include such a comment, the default encoding used will be ASCII.

如果您不包含此类注释,则使用的默认编码为ASCII。

If you get get hold of the book "Learning Python, 5th Edition", I encourage you to read Chapter 37 "Unicode and Byte Strings" in Part VIII. Advanced Topics. It contains detailed explanation for working with Unicode text in both generations of Python.

如果您掌握了“学习Python,第5版”这本书,我建议您阅读第VIII部分中的第37章“Unicode和字节字符串”。高级主题。它包含在两代Python中使用Unicode文本的详细说明。

Another detail worth mentioning is that format always returns an ascii string if the format string was ascii, no matter that the arguments were in unicode.

值得一提的另一个细节是,如果格式字符串是ascii,格式总是返回一个ascii字符串,无论参数是unicode。

Contrary to that, old style formatting with % returns a unicode string if any of the arguments are unicode. So instead of writing this

与此相反,如果任何参数是unicode,则使用%的旧样式格式化将返回unicode字符串。所以不要写这个

print(u"Writers: {}".format(writers))

you could write this, which is not only shorter and prettier, but works in both Python 2 and 3:

你可以写这个,它不仅更短更漂亮,而且适用于Python 2和3:

print("Writers: %s" % writers)

#2


3  

You could provide the Unicode prefix when formatting:

您可以在格式化时提供Unicode前缀:

print(u"Writers: {}".format(writers))

this does deal with the issue but, you are littering your Python 3 script with unnecessary u'' prefixes.

这确实可以处理这个问题但是,你用不必要的u''前缀乱丢你的Python 3脚本。

You could also from __future__ import unicode_literals after checking the version but I wouldn't do that, it is generally trickier to work with and has been considered for deprecation since the u'' prefix does the job sufficiently.

您也可以在检查版本后从__future__导入unicode_literals,但我不会这样做,使用起来通常比较棘手,因为u''前缀可以充分发挥作用,因此已经考虑过弃用。

#3


2  

In Python2 you should use unicode strings for join and print:

在Python2中,你应该使用unicode字符串进行连接和打印:

people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = u", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))

#4


0  

The answer is to make everything unicode:

答案是让所有东西都是unicode:

# -*- coding: utf-8 -*-
people = [u'Nicholas Gyeney', u'André']
writers = u", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))

#1


2  

First we assume that you want to support Python 2.7 and 3.5 versions (2.6 and 3.0 to 3.2 are handled a bit differently).

首先,我们假设您要支持Python 2.7和3.5版本(2.6和3.0到3.2的处理方式略有不同)。

As you have already read, setdefaultencoding is discouraged and actually not needed in your case.

正如您已经读过的那样,不建议使用setdefaultencoding,实际上并不需要。

To write cross platform code dealing with unicode text, you generally only need to specify string encoding at several places:

要编写处理unicode文本的跨平台代码,通常只需要在几个地方指定字符串编码:

  1. At top of your script, below the shebang with # -*- coding: utf-8 -*- (only if you have string literals with unicode text in your code)
  2. 在脚本的顶部,在shebang下方使用# - * - coding:utf-8 - * - (仅当您的代码中包含带有unicode文本的字符串文字时)
  3. When you read input data (eg. from text file or database)
  4. 当您读取输入数据时(例如,来自文本文件或数据库)
  5. When you output data (again from text file or database)
  6. 输出数据时(再次从文本文件或数据库)
  7. When you define a string literal in code
  8. 在代码中定义字符串文字时

Here is how I changed your example by following those rules:

以下是我按照这些规则更改示例的方法:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

people = ['Nicholas Gyeney', 'André']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

print(type(writers))
print(len(writers))

which outputs:

哪个输出:

<type 'str'>
23

Here is what changed:

这是改变了什么:

  • Specified file encoding at top of file
  • 在文件顶部指定文件编码
  • Replaced \xe9 with the actual Unicode character (é)
  • 用实际的Unicode字符替换\ xe9(é)
  • Removed u prefixes
  • 删除了你的前缀

It works just nicely in Python 2.7.12 and 3.5.2.

它在Python 2.7.12和3.5.2中运行得很好。

But be warned that removing the u prefixes will make python use regular str type instead of unicode (see output of print(type(writers))). In case of utf-8 it works in most places as if it were a unicode string, but when checking the text length a wrong value will be returned. In this example len returns 23, where the actual number of characters is 22. This is because the underlying type is str, which counts each byte as a character, but character é should actually be two bytes.

但请注意,删除u前缀将使python使用常规str类型而不是unicode(请参阅print(type(writers))的输出)。在utf-8的情况下,它在大多数地方工作,就好像它是一个unicode字符串,但在检查文本长度时,将返回错误的值。在这个例子中,len返回23,其中实际的字符数是22.这是因为底层类型是str,它将每个字节计为一个字符,但字符é实际上应该是两个字节。

In other words this works when outputing data fine (as in your example), but not if you want to do string manipulation on the text. In this case, you still need to use the u prefix or convert the data to unicode type excplicitly, before string manipulation.

换句话说,这在输出数据时很有效(如在您的示例中),但如果您想对文本执行字符串操作则不行。在这种情况下,您仍需要使用u前缀或在字符串操作之前将数据明确地转换为unicode类型。

So, if it was not for your simple example, it would be better to still use the u prefix. You need that in two places:

所以,如果不是你的简单例子,那么仍然使用u前缀会更好。你需要在两个地方:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

people = [u'Nicholas Gyeney', u'André']
writers = ", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))

print(type(writers))
print(len(writers))

which outputs:

哪个输出:

<type 'unicode'>
22

Note: u prefix was removed in Python 3.0 and then reintroduced again in Python 3.3 for backward compatibility.

注意:在Python 3.0中删除了u前缀,然后在Python 3.3中再次重新引入,以实现向后兼容。

Detailed explanation of all intricacies of working with unicode text in Python 2 is available in official documentation: Python 2 - Unicode HOWTO.

有关在Python 2中使用unicode文本的所有复杂性的详细说明,请参阅官方文档:Python 2 - Unicode HOWTO。

Here is an excerpt for the special comment specifying file encoding:

以下是指定文件编码的特殊注释的摘录:

Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

Python支持在任何编码中编写Unicode文字,但您必须声明正在使用的编码。这是通过将特殊注释包含在源文件的第一行或第二行来完成的:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = u'abcdé' print ord(u[-1])

The syntax is inspired by Emacs’s notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports coding. The -*- symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention. Python looks for coding: name or coding=name in the comment.

语法的灵感来自于Emacs用于指定文件本地变量的表示法。 Emacs支持许多不同的变量,但Python只支持编码。 - * - 符号表示Emacs评论是特殊的;它们对Python没有意义,但却是一种惯例。 Python在注释中查找编码:name或coding = name。

If you don’t include such a comment, the default encoding used will be ASCII.

如果您不包含此类注释,则使用的默认编码为ASCII。

If you get get hold of the book "Learning Python, 5th Edition", I encourage you to read Chapter 37 "Unicode and Byte Strings" in Part VIII. Advanced Topics. It contains detailed explanation for working with Unicode text in both generations of Python.

如果您掌握了“学习Python,第5版”这本书,我建议您阅读第VIII部分中的第37章“Unicode和字节字符串”。高级主题。它包含在两代Python中使用Unicode文本的详细说明。

Another detail worth mentioning is that format always returns an ascii string if the format string was ascii, no matter that the arguments were in unicode.

值得一提的另一个细节是,如果格式字符串是ascii,格式总是返回一个ascii字符串,无论参数是unicode。

Contrary to that, old style formatting with % returns a unicode string if any of the arguments are unicode. So instead of writing this

与此相反,如果任何参数是unicode,则使用%的旧样式格式化将返回unicode字符串。所以不要写这个

print(u"Writers: {}".format(writers))

you could write this, which is not only shorter and prettier, but works in both Python 2 and 3:

你可以写这个,它不仅更短更漂亮,而且适用于Python 2和3:

print("Writers: %s" % writers)

#2


3  

You could provide the Unicode prefix when formatting:

您可以在格式化时提供Unicode前缀:

print(u"Writers: {}".format(writers))

this does deal with the issue but, you are littering your Python 3 script with unnecessary u'' prefixes.

这确实可以处理这个问题但是,你用不必要的u''前缀乱丢你的Python 3脚本。

You could also from __future__ import unicode_literals after checking the version but I wouldn't do that, it is generally trickier to work with and has been considered for deprecation since the u'' prefix does the job sufficiently.

您也可以在检查版本后从__future__导入unicode_literals,但我不会这样做,使用起来通常比较棘手,因为u''前缀可以充分发挥作用,因此已经考虑过弃用。

#3


2  

In Python2 you should use unicode strings for join and print:

在Python2中,你应该使用unicode字符串进行连接和打印:

people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = u", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))

#4


0  

The answer is to make everything unicode:

答案是让所有东西都是unicode:

# -*- coding: utf-8 -*-
people = [u'Nicholas Gyeney', u'André']
writers = u", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))