笔记-scrapy-selector

时间:2023-03-10 03:32:54
笔记-scrapy-selector

笔记-scrapy-selector

scrapy版本:1.5.0

1.总述

scrapy内置selector建立在lxml上。

2.使用

可以使用xpath和css方法来进行解析,两者都返回列表;

sel = Selector(text=body).xpath('//div[@class="ip_list"/text()]').extract()

selector中也可以使用re()方法进行正则解析,使用方法类似于re库;

3.类用常用属性

Selector objects

class scrapy.selector.Selector(response=Nonetext=Nonetype=None)

response is an HtmlResponse or an XmlResponse object that will be used for selecting and extracting data.

text is a unicode string or utf-8 encoded text for cases when a response isn’t available. Using text and response together is undefined behavior.

type defines the selector type, it can be "html", "xml" or None (default).

If type is None, the selector automatically chooses the best type based on response type (see below), or defaults to "html" in case it is used together with text.

If type is None and a response is passed, the selector type is inferred from the response type as follows:

"html" for HtmlResponse type
"xml" for XmlResponse type
"html" for anything else
Otherwise, if type is set, the selector type will be forced and no detection will occur.

re(regex)

Apply the given regex and return a list of unicode strings with the matches.

regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)

extract()

Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.

remove_namespaces()

Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See example below.

SelectorList对象

selector类对象是内建list的一个子类,可以理解为多个selector对象组合,对selectorlist对象使用xpath,css,extract,re方法可以理解为对list中每一个对象使用方法后再将返回组合为一个列表(注意:返回值并不是作为一个整体进行插入)。