Python 2.7.3 urllib2.urlopen 获取网页出现乱码解决方案

出现乱码的原因是，网页服务端有bug，它硬性使用使用某种特定的编码方案，而并没有按照客户端的请求头的编码要求来发送编码。

解决方案：使用chardet来猜测网页编码。

1.去chardet官网下载chardet的py源码包。

2.把chardet目录从源码包里解压到项目文件夹内。

3.通过 import chardet 来引用它，然后：

 response = None

 #尝试下载网页

 try:

     response = urllib2.urlopen("http://www.baidu.com")

 except Exception as e:

     print "错误：下载网页时遇到问题：" + str(e)

     return

 if response.code != 200:

     print "错误：访问后，返回的状态代码（Code）并不是预期值【200】，而是【" + str(response.code) + "】"

     return

 if response.msg != "OK":

     print "错误：访问后，返回的状态消息并不是预期值【OK】，而是【" + response.msg + "】"

     return

 #读取html代码

 htmlCode = None

 try:

     htmlCode = response.read()

 except Exception as e:

     print "错误：下载完毕后，从响应流里读出网页代码时遇到问题：" + str(e)

     return

 #处理网页编码

 htmlCode_encode = None

 try:

     #猜编码类型

     htmlCharsetGuess = chardet.detect(htmlCode)

     htmlCharsetEncoding = htmlCharsetGuess["encoding"]

     #解码

     htmlCode_decode = htmlCode.decode(htmlCharsetEncoding)

     #获取系统编码

     currentSystemEncoding = sys.getfilesystemencoding()

     #按系统编码，再进行编码。

     '''

         做这一步的目的是，让编码出来的东西，可以在python中进行处理

         比如:

              key = "你好"

              str = "xxxx你好yyyy"

              keyPos = str.find( key )

         如果不做再编码，这一步就可能会报错出问题

     '''

     htmlCode_encode = htmlCode_decode.encode(currentSystemEncoding)

     except Exception as e:

         print "错误：在处理网页编码时遇到问题：" + str(e)

         return

 #htmlCode_encode即为所求

 return htmlCode_encode

秒客网

Python 2.7.3 urllib2.urlopen 获取网页出现乱码解决方案

相关文章