python之urllib2简单解析HTML页面之篇一

时间:2023-03-08 22:56:46
python之urllib2简单解析HTML页面之篇一

一、urllib2简单获取html页面

   #!/usr/bin/env python
# -*- coding:utf-8 -*- import urllib2
response = urllib2.urlopen('http://www.baidu.com');
html = response.read();
print html

简单的几行代码就能拿到html页面,接下来局势html的解析工作了。

想象很美好,实际操作就出问题了。baidu没有禁止机器人抓取可以正常抓取到页面,但是比如:https://b.ishadow.tech/是禁止机器人抓取的,简单模拟浏览器头部信息也不行。

然后想找个GitHub上的爬虫来试验一下行不行,因此找到了https://github.com/scrapy/scrapy,看样子好像比较叼。

按照readme安装了一下,安装失败了,仔细看了一下文档。

pip install scrapy

官方建议安装在Python的虚拟环境里,描述大概如下(https://doc.scrapy.org/en/latest/intro/install.html#intro-using-virtualenv):

Using a virtual environment (recommended)
TL;DR: We recommend installing Scrapy inside a virtual environment on all platforms. Python packages can be installed either globally (a.k.a system wide), or in user-space. We do not recommend installing scrapy system wide. Instead, we recommend that you install scrapy within a so-called “virtual environment” (virtualenv). Virtualenvs allow you to not conflict with already-installed Python system packages (which could break some of your system tools and scripts), and still install packages normally with pip (without sudo and the likes).

然后决定安装一个Python虚拟环境,命令如下:

$ sudo pip install virtualenv

查看基本使用

$virtualenv -h
Usage: virtualenv [OPTIONS] DEST_DIR

只需要 virtualenv加目标目录就可以了。

因此新建虚拟环境

$virtualevn e27
New python executable in ~/e27/bin/python
Installing setuptools, pip, wheel...done.

启用环境

$source ./bin/activate

注意切换环境成功后当前目录会有标识,如下

➜  e27 source ./bin/activate
(e27) ➜ e27

退出环境

$deactivate

现在开始干正事,在虚拟环境中安装爬虫(https://github.com/scrapy/scrapy

pip install scrapy

大约三分钟后安装完成,之前直接在全局环境安装尽然还失败了。成功后shell输出如下:

......
Successfully built lxml PyDispatcher Twisted pycparser
Installing collected packages: lxml, PyDispatcher, zope.interface, constantly, incremental, attrs, Automat, Twisted, ipaddress, asn1crypto, enum34, idna, pycparser, cffi, cryptography, pyOpenSSL, queuelib, w3lib, cssselect, parsel, pyasn1, pyasn1-modules, service-identity, scrapy
Successfully installed Automat-0.5. PyDispatcher-2.0. Twisted-17.1. asn1crypto-0.22. attrs-16.3. cffi-1.10. constantly-15.1. cryptography-1.8. cssselect-1.0. enum34-1.1. idna-2.5 incremental-16.10. ipaddress-1.0. lxml-3.7. parsel-1.1. pyOpenSSL-16.2. pyasn1-0.2. pyasn1-modules-0.0. pycparser-2.17 queuelib-1.4. scrapy-1.3. service-identity-16.0. w3lib-1.17. zope.interface-4.3.

安装好scrapy后尝试一个简单的连接

(e27) ➜  e27 scrapy shell 'http://quotes.toscrape.com/page/1/'

得到一堆结果如下

-- :: [scrapy.core.engine] INFO: Spider opened
-- :: [scrapy.core.engine] DEBUG: Crawled () <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x1100bab50>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response < http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x1100baad0>
[s] spider <DefaultSpider 'default' at 0x11037ebd0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser

证明是可以工作的,然后试一下连接:https://b.ishadow.tech/

(e27) ➜  e27 scrapy shell 'https://b.ishadow.tech/'  

结果如下:

-- :: [scrapy.middleware] INFO: Enabled item pipelines:
[]
-- :: [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:
-- :: [scrapy.core.engine] INFO: Spider opened
-- :: [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://b.ishadow.tech/> (failed 1 times): TCP connection timed out: 60: Operation timed out.
-- :: [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://b.ishadow.tech/> (failed 2 times): TCP connection timed out: 60: Operation timed out.
-- :: [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://b.ishadow.tech/> (failed 3 times): TCP connection timed out: 60: Operation timed out.
Traceback (most recent call last):

爬去超时了,看来是被识别出来是机器人爬取内容被拒绝的(当然此时站点通过浏览器是可以访问的),厉害了我的哥!到这里你是不是已经猜到我的真实目的了,没有的话请打开我爬取得连接看看就知道了。