pyppeteer爬虫例子

如果在centos上使用，需要安装下面的依赖

yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 -y

执行代码

import asyncio

import pyppeteer

from collections import namedtuple

Response = namedtuple("rs", "title url html cookies headers history status")

async def get_html(url, timeout=30):

    # 默认30s

    browser = await pyppeteer.launch(headless=True, args=['--no-sandbox'])

    page = await  browser.newPage()

    res = await page.goto(url, options={'timeout': int(timeout * 1000)})

    data = await page.content()

    title = await page.title()

    resp_cookies = await page.cookies()

    resp_headers = res.headers

    resp_history = None

    resp_status = res.status

    response = Response(title=title, url=url,

                        html=data,

                        cookies=resp_cookies,

                        headers=resp_headers,

                        history=resp_history,

                        status=resp_status)

    return response

if __name__ == '__main__':

    url_list = ["http://www.10086.cn/index/tj/index_220_220.html", "http://www.10010.com/net5/011/",

                "http://python.jobbole.com/87541/"]

    task = (get_html(url) for url in url_list)

    loop = asyncio.get_event_loop()

    results = loop.run_until_complete(asyncio.gather(*task))

    for res in results:

        print(res.title)

秒客网

pyppeteer爬虫例子

如果在centos上使用，需要安装下面的依赖

执行代码

相关文章