scrapy 的一个例子

时间:2023-03-09 21:56:50
scrapy 的一个例子

1、目标:

  scrapy 是一个爬虫构架,现用一个简单的例子来讲解,scrapy 的使用步骤

2、创建一个scrapy的项目:

  创建一个叫firstSpider的项目,命令如下:

scrapy startproject firstSpider 
[jianglexing@cstudio ~]$ scrapy startproject firstSpider
New Scrapy project 'firstSpider', using template directory '/usr/local/python-3.6.2/lib/python3.6/site-packages/scrapy/templates/project', created in:
/home/jianglexing/firstSpider You can start your first spider with:
cd firstSpider
scrapy genspider example example.com

  

3、创建一个项目时scrapy 命令干了一些什么:

  创建一个项目时scrapy 会创建一个目录,并向目录中增加若干文件

[jianglexing@cstudio ~]$ tree firstSpider/
firstSpider/
├── firstSpider
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── __pycache__
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg directories, files

4、进入项目所在的目录并创建爬虫:

[jianglexing@cstudio ~]$ cd firstSpider/
[jianglexing@cstudio firstSpider]$ scrapy genspider financeSpider www.financedatas.com
Created spider 'financeSpider' using template 'basic' in module:
firstSpider.spiders.financeSpider

5、一只爬虫在scrapy 项目中对应一个文件:

[jianglexing@cstudio firstSpider]$ tree ./
./
├── firstSpider
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── __pycache__
│ │ ├── __init__.cpython-.pyc
│ │ └── settings.cpython-.pyc
│ ├── settings.py
│ └── spiders
│ ├── financeSpider.py # 这个就是刚才创建的爬虫文件
│ ├── __init__.py
│ └── __pycache__
│ └── __init__.cpython-.pyc
└── scrapy.cfg

6、编写爬虫的处理逻辑:

  以爬取 http://www.financedatas.com 网站首页的title为例

# -*- coding: utf-8 -*-
import scrapy class FinancespiderSpider(scrapy.Spider):
name = 'financeSpider'
allowed_domains = ['www.financedatas.com']
start_urls = ['http://www.financedatas.com/'] def parse(self, response):
"""在parse方法中编写处理逻辑"""
print('*'*64)
title=response.xpath('//title/text()').extract() #xpath 语法抽取数据
print(title)
print('*'*64)

7、运行爬虫,查看效果:

[jianglexing@cstudio spiders]$ scrapy crawl financeSpider
-- :: [scrapy.utils.log] INFO: Scrapy 1.4. started (bot: firstSpider)
-- :: [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'firstSpider', 'NEWSPIDER_MODULE': 'firstSpider.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['firstSpider.spiders']}
.... ....
-- :: [scrapy.core.engine] DEBUG: Crawled () <GET http://www.financedatas.com/robots.txt> (referer: None)
-- :: [scrapy.core.engine] DEBUG: Crawled () <GET http://www.financedatas.com/> (referer: None)
****************************************************************
['欢迎来到 www.financedatas.com'
] # 这里就抽取到的数据
****************************************************************-- :: [scrapy.core.engine] INFO: Spider closed (finished)

----