最近因为做数据挖掘的需要，又再次复习和实践了Python编解码的一些知识。总结如下，如有疏漏还请各位同学指正。

1. 计算机编解码历史

请参考如下链接的文章，非常通俗易懂，难得的好文章:

字符编解码的故事（ASCII，ANSI，Unicode，Utf-8区别）

2. Python编解码介绍

Python2中的str对象是字节字符串，相当于java中的byte[]。str对象用来和人交互。

例如，ascii编码对英语足够，但是对中文就不够。中文可以使用gbk或gb2312之类的编码。这里的ascii、gbk、gb2312都属于str对象的一种。

而Python2内部使用unicode对象，相当于java中的String对象。注意python内部的unicode和真实的unicode是有点差别的，我们可以暂时忽略这些差别。

Python3中，所有字符串已是unicode编码,只能encode为某一个编码类型的字节字符串bytes，而不能decode。

2.1 Python编解码常用的模块:

codec 和chardet

2.2 Python编解码常用的方法：

myStr.decode               ：将myStr解码成unicode，参数指定的是s本来的编码方式。
myUnicode.encode      ：将myUnicode编码成str对象，参数指定使用的编码方式。
isinstance(s, str)           ：用来判断是否为一般字符串
isinstance(s, unicode) ：用来判断是否为unicode

codecs.encode(obj[, encoding[, errors]]) ：若不加参数，默认是进行ASCII编码。参数必须为unicode对象
codecs.decode(obj[, encoding[, errors]])：若不加参数，默认是进行ASCII解码

2.3 Python编解码用法示例：

>>> import chardet
>>> myStr = "ab"
>>> print len(myStr)
2
>>> print chardet.detect(myStr)
{'confidence': 1.0, 'encoding': 'ascii'}
>>> myStr = u"ab"
>>> myUEStr = u"ab"
>>> print len(myUEStr)
2
>>> print chardet.detect(myUEStr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build\bdist.win32\egg\chardet\__init__.py", line 25, in detect
ValueError: Expected a bytes object, not a unicode object
出错原因： chardet.detect(）函数输入参数不能是unicode类型
>>> myCnStr = "示例"
>>> print len(myCnStr)
4
>>> print chardet.detect(myCnStr)
{'confidence': 0.5475, 'encoding': 'windows-1252'}
>>> myUCStr = u"示例"
>>> print len(myUCStr)
2
分析：python unicode通常是2个字节表示一个字符（参见本位第一部分：编解码历史）
>>> print myCnStr.decode('windows-1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character u'\xca' in position 0: il
legal multibyte sequence
出错原因：到这一步我还以为仅仅是因为windows的cmd窗口默认使用gbk编码器（右键cmd窗口===》属性===》选项 ===》当前代码页）
>>> print myCnStr.decode('windows-1252').encode('gb18030')
????????
>>> print myCnStr.decode('windows-1252').encode('gbk')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character u'\xca' in position 0: il
legal multibyte sequence
出错分析：这两次出错让我很疑惑，本想把myCnStr转成unicode后再编码为GBK应该是合理的，为什么还会说GBK不识别呢？猜测应该是用'windows-1252'解码不能完全成功，chardet.detect(myCnStr)的confidence值太低所以结果不靠谱吗？
接下来尝试用gb18030解码就成功了。同样用gbk和gbk2312解码也可以。
>>> print myCnStr.decode('gb18030')
示例
分析: myCnStr解码后的unicode字符串可以在cmd窗口中输出正确，应该是因为cmd窗口的GBK解码器起了作用。
>>> print myCnStr.decode('gb18030').encode('gbk')
示例
>>> print len(myCnStr.decode('gb18030'))
2
>>> print len(myCnStr.decode('gb18030').encode('gbk'))
4
分析：gbk编码后长度为4，进一步说明了gbk编码是字节码（byte）字符串

3. Python中中访问mysql

（已在win7上验证可行。）

以下内容请谨慎参考，还请热心同学指正可能的错误。感觉我对mysql的配置和使用还有些疑惑。

模块: MySQLdb

*******************************************my-default.ini*******************************************

======================》(蓝色为自己新添加的部分)

# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.6/en/server-configuration-defaults.html
# *** DO NOT EDIT THIS FILE. It's a template which will be copied to the
# *** default location during install, and will be replaced if you
# *** upgrade to a newer version of MySQL.

[mysqld]

# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M

# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin

# These are commonly set, remove the # and set as required.
basedir = H:\Software\mysql-5.6.21-win32\mysql-5.6.21-win32
datadir = H:\\Software\mysql-5.6.21-win32\mysql-5.6.21-win32\data
# port = .....
# server_id = .....

# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M

sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
default-character-set = utf8
character_set_server = utf8

[mysqld_safe]
default-character-set = utf8

[client]
default-character-set = utf8

*****************************************************************************************************

**************************************Python代码示例**********************************************

#-*- encoding=utf-8 -*- #======>python只会检测encoding和等号后面的编码类型，-*-不是必须的
import MySQLdb
import chardet

#以下是让python默认使用utf-8编解码
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

... #省略无关代码
try:
   myConn = MySQLdb.connect(host='localhost',user='root',passwd='rootPWD',db='myDB',port=3306, charset='gbk')
   myCur = myConn.cursor()
   mysqlCmd = "insert into myTable values(%s, %s)"
   myParam = (strName, float(value))
   n = myCursor.execute(mysqlCmd, myParam)
   myConn.commit()
   myCur.close()
   myConn.close()
except MySQLdb.Error, e:
   print "Mysql Error %d: %s" % (e.args[0], e.args[1]

*****************************************************************************************************

几点注意：

(1) 在mysql内用 select * from myTable 读取表格内容时，若中文Field显示为乱码，可执行 set names gbk 然后再尝试select。

(2) MySQL 对于字符集的支持细化到四个层次: 服务器(server)，数据库(database)，数据表(table)和行()。

IT人的微信自媒体--- 杰天空，走在寻找创意的路上
发掘创意，点缀生活，品味人生。
请搜索微信订阅号： jksy_studio ，或者微信扫描下图二维码添加关注
Python字符串编码+MySQLdb中的中文字符问题
杰天空静候您的光临。

秒客网

Python字符串编码+MySQLdb中的中文字符问题