pyspider示例代码一：利用phantomjs解决js问题

本系列文章主要记录和讲解pyspider的示例代码，希望能抛砖引玉。pyspider示例代码官方网站是http://demo.pyspider.org/。上面的示例代码太多，无从下手。因此本人找出一下比较经典的示例进行简单讲解，希望对新手有一些帮助。

示例说明：

如果页面中部分数据或文字由js生成，pyspider不能直接提取页面的数据。pyspider获取页面的代码，但是其中的js代码phantomjs，解决js代码执行问题。

使用方法：

方法一：在self.crawl函数中添加fetch_type="js"调用phantomjs执行js代码。

方法二：为函数添加参数@config(fetch_type="js")。

示例代码：

1、www.sciencedirect.com网站示例

#!/usr/bin/env python

# -*- encoding: utf- -*-

# vim: set et sw= ts= sts= ff=unix fenc=utf8:

# Created on -- ::

import re

from libs.base_handler import *

class Handler(BaseHandler):

    '''

    this is a sample handler

    '''

    crawl_config = {

        "headers": {

            "User-Agent": "BaiDu_Spider",

        },

        "timeout":,

        "connect_timeout":

    }

    def on_start(self):

        self.crawl('http://www.sciencedirect.com/science/article/pii/S1568494612005741',timeout=,connect_timeout=,

                   callback=self.detail_page)

        self.crawl('http://www.sciencedirect.com/science/article/pii/S0167739X12000581',timeout=,connect_timeout=,

                   age=, callback=self.detail_page)

        self.crawl('http://www.sciencedirect.com/science/journal/09659978',timeout=,connect_timeout=,

                   age=, callback=self.index_page)

    @config(fetch_type="js")

    def index_page(self, response):

        for each in response.doc('a').items():

            url=each.attr.href

            #print(url)

            if url!=None:

                if re.match('http://www.sciencedirect.com/science/article/pii/\w+$', url):

                    self.crawl(url, callback=self.detail_page,timeout=,connect_timeout=)

    @config(fetch_type="js")

    def detail_page(self, response):

        self.index_page(response)

        self.crawl(response.doc('#relArtList > li > .cLink').attr.href, callback=self.index_page,timeout=,connect_timeout=)

        return {

                "url": response.url,

                "title": response.doc('.svTitle').text(),

                "authors": [x.text() for x in response.doc('.authorName').items()],

                "abstract": response.doc('.svAbstract > p').text(),

                "keywords": [x.text() for x in response.doc('.keyword span').items()],

                }

秒客网

pyspider示例代码一：利用phantomjs解决js问题

示例说明：

使用方法：

示例代码：

相关文章