python 生产者 --- 消费者

值得拿出来看看的

多进程爬取 (生产) ，解析 (消费) 网页同时进行，可以作为以后项目扩展使用

from bs4 import BeautifulSoup

import requests

import time

import multiprocessing as mp

import re

from multiprocessing import Queue

# from multiprocessing import JoinableQueue as Queue

base_url = 'https://morvanzhou.github.io/'

def crawl(url):

    html = requests.get(url).text

    # 模拟请求时间消耗 0.1 s

    time.sleep(0.1)

    return html

def parse(html):

    soup = BeautifulSoup(html,'lxml')

    all_anchors = soup.find_all('a',{'href':re.compile(r'^/.+?/$')})

#     title = soup.find('meta',{'property':'og:title'})

    page_urls = {anchor.get_text().strip():base_url+anchor['href'] for anchor in all_anchors}

    main_url = soup.find('meta',{'property':'og:url'})['content']

    return main_url,page_urls

# print(html)

def main():

    # unseen 本可以定义多个

    unseen = (base_url,)

    seen = ()

    # 为了让 html 爬取 与 html 解析 同步进行，所以这里使用 生产者--消费者 模式

    html_queue = Queue()

    # 开启进程池

    # 生产者 即 html 爬取

    crawl_pool = mp.Pool(2)

    # 消费者 即 html 解析

    parse_pool = mp.Pool(2)

    for url in unseen:

        # 若一直 有 要被爬取的 html 则 一直进行

        html_queue.put(crawl_pool.apply_async(crawl,args=(url,)).get())

    else:

        # 已经爬取完成所有 页面

        html_queue.put(None) # 此处向队列发送 生产完成信号,不然方法一直被阻塞

    results = []

    # 开启循环 消费生产出的 html，对其进行解析

    while True:

        html=html_queue.get()

        if html:

            results.append(parse_pool.apply_async(parse,args=(html,)).get())

        else:

#             html_queue.task_done()

            break

    print(results)

if __name__ == '__main__':

    main()

秒客网

python 生产者 --- 消费者

值得拿出来看看的

多进程爬取 (生产) ，解析 (消费) 网页同时进行，可以作为以后项目扩展使用

相关文章

python 生产者 --- 消费者

值得拿出来 看看的

多进程 爬取 (生产) ， 解析 (消费) 网页 同时进行，可以作为以后项目扩展使用

相关文章

值得拿出来看看的

多进程爬取 (生产) ，解析 (消费) 网页同时进行，可以作为以后项目扩展使用