scrapy爬虫提取网页链接的两种方法以及构造HtmlResponse对象的方式

Response对象的几点说明：

　　Response对象用来描述一个HTTP响应，Response只是一个基类，根据相应的不同有如下子类：

　　　　TextResponse，HtmlResponse，XmlResponse

　　仅以HtmlResponse为例，HtmlResponse在基类Response的基础上，还多了很多新的方法。

一.使用Selector

　　　　因为链接也是页面中的数据，所以可以使用与提取数据相同的方法进行提取。在分析网页时可以通过jupyter notebook构造selector对象进行分析（selector对象有xpath和css方法）

　　　　　　import requests

　　　　　　from scrapy.selector import Selector

　　　　　　res=requests.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

　　　　　　selector=Selector(response=res)

二 .使用 scrapy框架中的linkextractors模块

　　　　用法见相关资料

　　1. le.extractor_links(response)中的response指的是HtmlResponse

　　2.HtmlResponse的构造方法：

from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
import requests

#先构造Response对象，再用Response对象构造HtmlResponse对象，从而能够使用linkextractor模块

ResStack=requests.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

res = HtmlResponse(url="http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" , body=ResStack.text , encoding="utf-8")

注：1.HtmlResponse包含多种参数，具体如何使用可查书

　　2.HtmlResponse也包含多种方法，比如css，xpath，text等方法，也可以通过jupyter notebook进行网页分析，而且也可以使用linkextractor提取链接，更加方便

秒客网