Python编码相关问题 Unicode UTF-8 encode() decode()

① http://www.cnblogs.com/huxi/archive/2010/12/05/1897271.html 这篇文章写的还不错，转载看看：

最常用的是unicode与 utf-8之间的转换

unicode对象调用encode("utf-8")之后成为str对象(采用utf-8编码)

str(一般采用utf-8编码)对象调用decode("utf-8")之后成为unicode对象

② http://appofis.iteye.com/blog/443206 这篇文章中讲到：

貌似有了 unicode对象的encode()方法，str对象的decode()方法就足够了

但是unicode对象同样也有decode()方法（ unicode已经是str解码之后的了，居然还可以decode），str对象同样也有encode()方法（str已经是unicode编码之后的了，居然还可以encode）

那么这样用之后是什么效果呢？

(1) .str.encode(e) is the same as unicode(str).encode(e).
This is useful since code that expects Unicode strings should also work when it is passed
ASCII-encoded 8-bit strings(from Guido van Rossum)
python之父的这段话大概意思是说encode方法本来是被unicode调的，但如果不小心被作为str对象的方法调，并且这个str对象正好
是ascii编码的（ascii这一段和unicode是一样的），也应该让他成功。这就是str.encode方法的一个用处（我觉得这个基本等于没用）
类似地，把光用ascii组成的unicode再decode一回是一样的道理，因为好像几乎任何编码里ascii都原样没变。因此这样的操作等于没做。
u"abc".decode("gb2312")和u"abc"是相等的。

# -*- coding : utf-8 -*-

# python2.xif __name__ == "__main__":print type(u"abc")print type(u"abc".decode("gb2312"))print repr(u"abc")print repr(u"abc".decode("gb2312"))print type(u"abc")print type(u"abc".decode("utf-8"))print repr(u"abc")print repr(u"abc".decode("utf-8"))

在python2下运行，输出结果为：

<type 'unicode'>
<type 'unicode'>
u'abc'
u'abc'
<type 'unicode'>
<type 'unicode'>
u'abc'
u'abc'

在python3下运行报错：

Traceback (most recent call last):
<class 'str'>
File "G:/Workspace/JetBrains/untitled/test.py", line 4, in <module>
print (type(u"abc".decode("gb2312")))
AttributeError: 'str' object has no attribute 'decode'

Process finished with exit code 1

因为在Python 3.x版本中，把'xxx'和u'xxx'统一成Unicode编码，即写不写前缀u都是一样的，而以字节形式表示的字符串则必须加上b前缀：b'xxx'。

# -*- coding : utf-8 -*-

<pre name="code" class="python" style="font-size: 14px; line-height: 25.2000007629395px;"># python3.x

if __name__ == "__main__":print (type(u"abc"))print ("abc".encode("utf-8"))print (repr("abc".encode("utf-8")))print (type("abc".encode("utf-8")))

在python3下运行，输出结果为：

(2) http://blog.****.net/haichao062/article/details/8107316

关于decode('unicode-escape')

之前在世纪佳缘的爬虫数据解析中也用到过

一串字符，是unicode码，如：‘\u53eb\u6211’，进行反编码后得到其对应的汉字

# -*- coding : utf-8 -*-
if __name__ == "__main__":
u = "\u53eb\u6211"
print(u.decode('unicode-escape'))
print(type(u.decode('unicode-escape')))

在python2下运行，输出结果是：

叫我
<type 'unicode'>

PS: 关于python的unicode-escape

http://*.com/questions/2969044/python-string-escape-vs-unicode-escape

秒客网

Python编码相关问题 Unicode UTF-8 encode() decode()

相关文章