声明为utf-8的模块中的Python unicode字符串文字

时间:2023-01-05 23:26:23

I have a dummie Python module with the utf-8 header that looks like this:

我有一个带有utf-8标头的dummie Python模块,如下所示:

# -*- coding: utf-8 -*-
a = "á"
print type(a), a

Which prints:

<type 'str'> á

But I thought that all string literals inside a Python module declared as utf-8 whould automatically be of type unicode, intead of str. Am I missing something or is this the correct behaviour?

但是我认为声明为utf-8的Python模块中的所有字符串文字都应该自动为unicode类型,str为intead。我错过了什么或者这是正确的行为吗?

In order to get a as an unicode string I use:

为了得到一个unicode字符串,我使用:

a = u"á"

But this doesn't seem very "polite", nor practical. Is there a better option?

但这似乎不是很“礼貌”,也不实用。有更好的选择吗?

3 个解决方案

#1


5  

No, the codec at the top only informs Python how to interpret the source code, and uses that codec to interpret Unicode literals. It does not turn literal bytestrings into unicode values. As PEP 263 states:

不,顶部的编解码器仅告知Python如何解释源代码,并使用该编解码器来解释Unicode文字。它不会将文字字节串转换为unicode值。正如PEP 263所述:

This PEP proposes to introduce a syntax to declare the encoding of a Python source file. The encoding information is then used by the Python parser to interpret the file using the given encoding. Most notably this enhances the interpretation of Unicode literals in the source code and makes it possible to write Unicode literals using e.g. UTF-8 directly in an Unicode aware editor.

该PEP建议引入一种语法来声明Python源文件的编码。然后,Python解析器使用编码信息来使用给定的编码来解释文件。最值得注意的是,这增强了源代码中Unicode文字的解释,并且可以使用例如Unicode来编写Unicode文字。 UTF-8直接在Unicode识别编辑器中。

Emphasis mine.

Without the codec declaration, Python has no idea how to interpret non-ASCII characters:

没有编解码器声明,Python不知道如何解释非ASCII字符:

$ cat /tmp/test.py 
example = '☃'
$ python2.7 /tmp/test.py 
  File "/tmp/test.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file /tmp/test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

If Python behaved the way you expect it to, you would not be able to literal bytestring values that contain non-ASCII byte values either.

如果Python的行为方式符合您的预期,那么您将无法使用包含非ASCII字节值的字节串值。

If your terminal is configured to display UTF-8 values, then printing a UTF-8 encoded byte string will look 'correct', but only by virtue of luck that the encodings match.

如果您的终端配置为显示UTF-8值,那么打印UTF-8编码的字节字符串将看起来“正确”,但仅凭运气编码匹配。

The correct way to get unicode values, is by using unicode literals or by otherwise producing unicode (decoding from byte strings, converting integer codepoints to unicode characters, etc.):

获取unicode值的正确方法是使用unicode文字或以其他方式生成unicode(从字节字符串解码,将整数代码点转换为unicode字符等):

unicode_snowman = '\xe2\x98\x83'.decode('utf8')
unicode_snowman = unichr(0x2603)

In Python 3, the codec also applies to how variable names are interpreted, as you can use letters and digits outside of the ASCII range in names. The default codec in Python 3 is UTF-8, as opposed to ASCII in Python 2.

在Python 3中,编解码器也适用于如何解释变量名称,因为您可以在名称中使用ASCII范围之外的字母和数字。 Python 3中的默认编解码器是UTF-8,而不是Python 2中的ASCII。

#2


6  

# -*- coding: utf-8 -*-

doesn't make the string literals Unicode. Take this example, I have a file with an Arabic comment and string, file is utf-8:

不会使字符串文字Unicode。举个例子,我有一个带有阿拉伯语注释和字符串的文件,文件是utf-8:

# هذا تعليق عربي
print type('نص عربي')

if I run it it will throw a SyntaxError exception:

如果我运行它会抛出一个SyntaxError异常:

SyntaxError: Non-ASCII character '\xd9' in file file.py
on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

so to allow this I have to add that line to tell the interpreter that the file is UTF-8 encoded:

所以为了允许这个,我必须添加该行来告诉解释器该文件是UTF-8编码的:

# -*-coding: utf-8 -*-

# هذا تعليق عربي
print type('نص عربي')

now it runs fine but it still prints <type 'str'> unless I make the string Unicode:

现在它运行正常,但它仍然打印 除非我使字符串Unicode: 'str'>

# -*-coding: utf-8 -*-

# هذا تعليق عربي
print type(u'نص عربي')

#3


2  

No this is just source code encoding. Please see http://www.python.org/dev/peps/pep-0263/

不,这只是源代码编码。请参阅http://www.python.org/dev/peps/pep-0263/

To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:

要定义源代码编码,必须将魔术注释作为文件中的第一行或第二行放入源文件中,例如:

      # coding=<encoding name>

or (using formats recognized by popular editors)

      #!/usr/bin/python
      # -*- coding: <encoding name> -*-

or

      #!/usr/bin/python
      # vim: set fileencoding=<encoding name> :

This doesn't make all literals unicode just point how unicode literals should be decoded.

这不会使所有文字unicode只指出unicode文字应该如何解码。

One should use unicode function or u prefix to set literal as unicode.

应该使用unicode函数或u前缀将literal设置为unicode。

N.B. in python3 all strings are unicode.

注:在python3中,所有字符串都是unicode。

#1


5  

No, the codec at the top only informs Python how to interpret the source code, and uses that codec to interpret Unicode literals. It does not turn literal bytestrings into unicode values. As PEP 263 states:

不,顶部的编解码器仅告知Python如何解释源代码,并使用该编解码器来解释Unicode文字。它不会将文字字节串转换为unicode值。正如PEP 263所述:

This PEP proposes to introduce a syntax to declare the encoding of a Python source file. The encoding information is then used by the Python parser to interpret the file using the given encoding. Most notably this enhances the interpretation of Unicode literals in the source code and makes it possible to write Unicode literals using e.g. UTF-8 directly in an Unicode aware editor.

该PEP建议引入一种语法来声明Python源文件的编码。然后,Python解析器使用编码信息来使用给定的编码来解释文件。最值得注意的是,这增强了源代码中Unicode文字的解释,并且可以使用例如Unicode来编写Unicode文字。 UTF-8直接在Unicode识别编辑器中。

Emphasis mine.

Without the codec declaration, Python has no idea how to interpret non-ASCII characters:

没有编解码器声明,Python不知道如何解释非ASCII字符:

$ cat /tmp/test.py 
example = '☃'
$ python2.7 /tmp/test.py 
  File "/tmp/test.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file /tmp/test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

If Python behaved the way you expect it to, you would not be able to literal bytestring values that contain non-ASCII byte values either.

如果Python的行为方式符合您的预期,那么您将无法使用包含非ASCII字节值的字节串值。

If your terminal is configured to display UTF-8 values, then printing a UTF-8 encoded byte string will look 'correct', but only by virtue of luck that the encodings match.

如果您的终端配置为显示UTF-8值,那么打印UTF-8编码的字节字符串将看起来“正确”,但仅凭运气编码匹配。

The correct way to get unicode values, is by using unicode literals or by otherwise producing unicode (decoding from byte strings, converting integer codepoints to unicode characters, etc.):

获取unicode值的正确方法是使用unicode文字或以其他方式生成unicode(从字节字符串解码,将整数代码点转换为unicode字符等):

unicode_snowman = '\xe2\x98\x83'.decode('utf8')
unicode_snowman = unichr(0x2603)

In Python 3, the codec also applies to how variable names are interpreted, as you can use letters and digits outside of the ASCII range in names. The default codec in Python 3 is UTF-8, as opposed to ASCII in Python 2.

在Python 3中,编解码器也适用于如何解释变量名称,因为您可以在名称中使用ASCII范围之外的字母和数字。 Python 3中的默认编解码器是UTF-8,而不是Python 2中的ASCII。

#2


6  

# -*- coding: utf-8 -*-

doesn't make the string literals Unicode. Take this example, I have a file with an Arabic comment and string, file is utf-8:

不会使字符串文字Unicode。举个例子,我有一个带有阿拉伯语注释和字符串的文件,文件是utf-8:

# هذا تعليق عربي
print type('نص عربي')

if I run it it will throw a SyntaxError exception:

如果我运行它会抛出一个SyntaxError异常:

SyntaxError: Non-ASCII character '\xd9' in file file.py
on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

so to allow this I have to add that line to tell the interpreter that the file is UTF-8 encoded:

所以为了允许这个,我必须添加该行来告诉解释器该文件是UTF-8编码的:

# -*-coding: utf-8 -*-

# هذا تعليق عربي
print type('نص عربي')

now it runs fine but it still prints <type 'str'> unless I make the string Unicode:

现在它运行正常,但它仍然打印 除非我使字符串Unicode: 'str'>

# -*-coding: utf-8 -*-

# هذا تعليق عربي
print type(u'نص عربي')

#3


2  

No this is just source code encoding. Please see http://www.python.org/dev/peps/pep-0263/

不,这只是源代码编码。请参阅http://www.python.org/dev/peps/pep-0263/

To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:

要定义源代码编码,必须将魔术注释作为文件中的第一行或第二行放入源文件中,例如:

      # coding=<encoding name>

or (using formats recognized by popular editors)

      #!/usr/bin/python
      # -*- coding: <encoding name> -*-

or

      #!/usr/bin/python
      # vim: set fileencoding=<encoding name> :

This doesn't make all literals unicode just point how unicode literals should be decoded.

这不会使所有文字unicode只指出unicode文字应该如何解码。

One should use unicode function or u prefix to set literal as unicode.

应该使用unicode函数或u前缀将literal设置为unicode。

N.B. in python3 all strings are unicode.

注:在python3中,所有字符串都是unicode。