Python网页爬虫（一）

很多时候我们想要获得网站的数据，但是网站并没有提供相应的API调用，这时候应该怎么办呢？还有的时候我们需要模拟人的一些行为，例如点击网页上的按钮等，又有什么好的解决方法吗？这些正是python和网页爬虫的应用场景。python是一种动态解释性语言，简单的语法和强大的库支持使得python在数据收集、数据分析、网页分析、科学计算等多个领域被广泛使用。

本文主要总结一下如何用python自己写一个简单的爬虫，以及可能出现的问题与解决方法。

首先介绍一下大概的思路，首先需要在程序中连接网站并发送GET请求得到html文件，然后需要解析html文件，根据某种规律将需要的信息提取出来，然后用合适的方式处理数据并保存。

（一）python中用于http连接的库——urllib2

　　首先放上python文档的链接，英文好的同学可*阅读：https://docs.python.org/2/library/urllib2.html?highlight=urllib2#module-urllib2

　　这个库是python用来解析URL主要是HTTP的主要工具，而且提供了对身份认证、重定向、cookie等的支持，具体用法可以仔细的读文档，这里作为最简单的场景，我介绍如何打开一个URL并得到对应的HTML页面。只要使用一个函数：

　　urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])

　　1、函数的参数url可以是一个string也可以是一个request对象，如果提供了参数列表中data的值（dict类型）那么函数将发送POST请求，默认情况是发送GET。另外，很多人说没法设置连接的timeout，然后提供了一堆类似修改socket全局timeout的方法，其实这里可以直接设定的嘛！

　　2、函数的返回值。函数返回一个类文件的东西，which means，函数返回的值可以当一个文件类来操作，只是多了三个方法：

- geturl（）返回这个html对应的url
- info（）返回这个html文件的元信息例如headers
- getcode（）返回网站的返回的状态码，如200表示OK，404表示not found之类。

　　3、函数抛出的异常：URLError例如发生无网络连接这种事情。socket.timeout如果你设置了timeout参数，超时后便会抛出此异常。

　　所以这里的代码可以这样写：

 import urllib2

 import socket

 def gethtml(url):

     try:

         f = urllib2.urlopen(url,timeout=10)

         data = f.read()

     except socket.timeout, e:

         data = None

         print "time out!"

         with open("timeout",'a') as log:

             log.write(url+'\n')

     except urllib2.URLError,ee:

         data = None

         print "url error"

     finally:

         return data

　　这样就可以得到对应URL的网页的html代码了（但是在具体应用你可能会碰到蛋疼的问题，我碰到的会列在下面）

　　（二）对html的解析

　　获得了网页源代码后我们需要从里面提取出接下来要爬取的URL或者我们需要的数据，这就需要对html解析。2.7版本的python中，内置的解析类是HTMLParser，库文档：https://docs.python.org/2/library/htmlparser.html?highlight=htmlparser

　　其实，你要做的全部，就是继承这个类，然后把里面的接口函数实现，他看起来是这样子的：

 from HTMLParser import HTMLParser

 # create a subclass and override the handler methods

 class MyHTMLParser(HTMLParser):

     def handle_starttag(self, tag, attrs):

         print "Encountered a start tag:", tag

     def handle_endtag(self, tag):

         print "Encountered an end tag :", tag

     def handle_data(self, data):

         print "Encountered some data  :", data

　　例如对于一个最简单的html文件：

 <html>

     <head>

         <title>Test</title>

     </head>

     <body>

         <h1>Parse me!</h1>

     </body>

 </html>

　　当程序遇到每一个标签的开始时便会调用handle_starttag函数，遇到标签结束就会调用handle_endtag，遇到标签之间的数据就会调用handle_data可以结合最后贴出的示例代码进一步理解。想要解析这个文件只要如下写：

＃例如借用上面的gethtml函数

def parse():

    html = gethtml("http://baidu.com",timeout=10)

    parser = MyParser()

    parser.feed(html)

    parser.close()

　　调用feed函数把html放入缓冲区，调用close函数强制parser开始解析html

　　（三）问题与解决方法：

　　1、得到的html文档是乱码

　　这个问题可能有很多原因引起，最主要的两个是：网页的编码方式和你的解码方式不匹配。关于编码和解码推荐读一下这个博客：http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001386819196283586a37629844456ca7e5a7faa9b94ee8000

　　这时候你会明白编码问题确实导致了文件的不可读。所以你要正确获得网页的编码方式，然后调用对应的解码器，代码可能是这样的：

f = urllib2.urlopen(url,timeout=10)

data = f.read()

# decode the html

contentType = f.headers.get('Content-Type')

if contentType.find("gbk"):

    data = unicode(data, "GBK").encode("utf-8")

elif contentType.find("utf-8"):

    pass

　　当然，编码方式可能多种，不止GBK

　　2、还是乱码

　　如果还是乱码，可能你依然没有选对解码器，也可能是网页被压缩过了，这时候你要先把网页解压缩，然后正确解码，解码的代码可能是这样的：

 import StringIO, gzip

 f = urllib2.urlopen(url,timeout=10)

 # consider some html is compressed by server with gzip

 isGzip = f.headers.get('Content-Encoding')

 if isGzip:

     compressedData = f.read()

     compressedStream = StringIO.StringIO(compressedData)

     gzipper = gzip.GzipFile(fileobj=compressedStream)

     data = gzipper.read()

 else:

     data = f.read()

　　3、调用urlopen函数后程序卡住

　　这个的主要问题是没有设置timeout参数，导致网络不通畅时引起urlopen函数无限等待，程序卡死，所以，设置timeout＝10例如，单位是秒，并处理抛出的socket.timeout异常，就可以避免这个问题了。

　　4、被服务器拉黑

　　如果你对某个域名访问速度太快，就可能被服务器认定为潜在的DDos攻击，IP就会被封一段时间。解决方法是不要过快的访问可以使用sleep语句参数单位为秒。

import time

for url in urllist:

    f = urllib2.urlopen(url,timeout=10)

    ...

    time.sleep(1)

　（四）小爬虫

　　自己写了一个小爬虫，目的是爬取www.gaokao.com网上所有本科大学历年的分数线等数据，其中schoollist文件存储了所有大学的ID号，你可以手写一个list代替例如：

schoollist = ['','','']

　　代码就暴力贴了：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

__author__ = 'holmes'

from HTMLParser import HTMLParser

import urllib2

import StringIO, gzip

import threading

import os

import time

import sys

import socket

# hard code

LOCATION = ("北京", "天津", "辽宁", "吉林", "黑龙江", "上海", "江苏", "浙江", "安徽", "福建", "山东", "湖北",

            "湖南", "广东", "重庆", "四川", "陕西", "甘肃", "河北", "山西", "内蒙古", "河南", "海南", "广西",

            "贵州", "云南", "*", "青海", "宁夏", "*", "江西",)

# hard code

SUBJECT = ("理科", "文科",)

'''Rules for URL

http://college.gaokao.com/school/tinfo/%d/result/%d/%d/ %(schoolID,localID,subID)

where localID from 1 to 31

where subID from 1 to 2

'''

SEED = "http://college.gaokao.com/school/tinfo/%s/result/%s/%s/"

SID = "schoolID"  # file name contains school IDs

class SpiderParser(HTMLParser):

    def __init__(self, subject=1, location=1):

        HTMLParser.__init__(self)

        self.campus_name = ""

        self.subject = SUBJECT[subject - 1]

        self.location = LOCATION[location - 1]

        self.table_content = [[], ]

        self.line_no = 0

        self.__in_h2 = False

        self.__in_table = False

        self.__in_td = False

    def handle_starttag(self, tag, attrs):

        if tag == "h2":

            self.__in_h2 = True

        if tag == "table":

            self.__in_table = True

        if tag == "tr" and len(attrs) != 0:

            if self.__in_table:

                self.table_content[self.line_no].append(self.campus_name)

                self.table_content[self.line_no].append(self.subject)

                self.table_content[self.line_no].append(self.location)

        if tag == "td":

            if self.__in_table:

                self.__in_td = True

    def handle_endtag(self, tag):

        if tag == "h2":

            self.__in_h2 = False

        if tag == "table":

            self.__in_table = False

        if tag == "tr":

            if self.__in_table:

                self.line_no += 1

                self.table_content.append([])

        if tag == "td":

            if self.__in_table:

                self.__in_td = False

    def handle_data(self, data):

        if self.__in_h2:

            self.campus_name = data

        if self.__in_td:

            self.table_content[self.line_no].append(data)

def getschoolID():

    with open(SID, mode='r') as rf:

        idlist = rf.readlines()

        print idlist

        return idlist

def gethtml(url):

    try:

        f = urllib2.urlopen(url,timeout=10)

    # consider some html is compressed by server with gzip

        isGzip = f.headers.get('Content-Encoding')

        if isGzip:

            compressedData = f.read()

            compressedStream = StringIO.StringIO(compressedData)

            gzipper = gzip.GzipFile(fileobj=compressedStream)

            data = gzipper.read()

        else:

            data = f.read()

    # decode the html

        contentType = f.headers.get('Content-Type')

        if contentType.find("gbk"):

            data = unicode(data, "GBK").encode("utf-8")

        elif contentType.find("utf-8"):

            pass

    except socket.timeout, e:

        data = None

        print "time out!"

        with open("timeout",'a') as log:

            log.write(url+'\n')

    finally:

        return data

def parseandwrite((slID, lID, sID), wfile):

    try:

        url = SEED % (slID, lID, sID)

        html = gethtml(url)

        if html is None:

            print "pass a timeout"

            return

        parser = SpiderParser(sID, lID)

        parser.feed(html)

        parser.close()

        if parser.line_no != 0:

            for line in parser.table_content:

                for item in line:

                    if "--" in item:

                        item = "NULL"

                    wfile.write(item + ',')

                wfile.write('\n')

    except urllib2.URLError, e:

        print "url error in parseandwrite()"

        raise

def thread_task(idlist, name):

    try:

        print "thread %s is start" % name

        wf = open(name, mode='w')

        wf.write("大学名称,文理,省份,年份,最低分,最高分,平均分,录取人数,录取批次")

        for sID in idlist:

            print name + ":%s" % idlist.index(sID)

            sID = sID.strip('\n')

            i = 1.0

            for localID in range(1, 32):

                for subID in range(1, 3):

                    parseandwrite((sID, localID, subID), wf)

                    sys.stdout.write("\rprocess:%.2f%%" % (i / 62.0 * 100))

                    sys.stdout.flush()

                    i += 1.0

                    time.sleep(1)

    except urllib2.URLError:

        with open("errorlog_" + name, 'w') as f:

            f.write("schoolID is %s , locationID is %s ,subID is %s" % (sID, localID, subID))

        print "schoolID is %s ,locationID is %s ,subID is %s" % (sID, localID, subID)

    finally:

        wf.close()

THREAD_NO = 1

def master():

    school = getschoolID()

    for i in range(THREAD_NO):

        path = os.path.join(os.path.curdir, "errorlog_" + str(i))

        if os.path.exists(path):

            with open("errorlog_" + str(i), 'r') as rf:

                sID = rf.readline().split()[2]

                start = school.index(sID)

        else:

            start = len(school) / THREAD_NO * i

        end = len(school) / THREAD_NO * (i + 1) - 1

        if i == THREAD_NO - 1:

            end = len(school) - 1

        t = threading.Thread(target=thread_task, args=(school[start:end], "thread" + str(i),))

        t.start()

        print "start:%s \n end:%s" % (start, end)

    t.join()

    print "finish"

if __name__ == '__main__':

    # thread_task(["1"],"test")

    master()

    # gethtml("http://www.baidu.com")

秒客网

Python网页爬虫（一）

相关文章