Python--开发简单爬虫

简单爬虫架构

Python--开发简单爬虫

动态运行流程

Python--开发简单爬虫

URL管理器的作用

Python--开发简单爬虫

URL管理器的3种实现方式

Python--开发简单爬虫

网页下载器的作用

Python--开发简单爬虫

Python网页下载器的种类

Python--开发简单爬虫

urllib2下载网页的3种方法

Python--开发简单爬虫

网页解析器的作用

Python--开发简单爬虫

Python的几种网页解析器

Python--开发简单爬虫

结构化解析依赖DOM树

Python--开发简单爬虫

Beautiful Soup语法

Python--开发简单爬虫

代码举例：

1.创建Beautiful Soup对象

 from bs4 import BeautifulSoup

 soup = BeautifulSoup(

     html_doc,               #HTML文档字符串

     'heml.parser',          #HTML解析器

     from_encoding='utf-8'   #HTML文档的编码

 )

2.find_all find方法的使用

Python--开发简单爬虫

3.访问节点信息

Python--开发简单爬虫

4.Beautiful Soup处理html文档举例

 from bs4 import BeautifulSoup

 import re

 html_doc = """

 <html><head><title>The Dormouse's story</title></head>

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters; and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 soup = BeautifulSoup(

     html_doc,               #HTML文档字符串

     'html.parser',          #HTML解析器

     from_encoding='utf-8'   #HTML文档的编码

 )

 print('获取所有的连接')

 links = soup.find_all('a')

 for link in links:

     print(link.name,link['href'],link.get_text())

 print('获取tillie的连接')

 link_node = soup.find('a',href='http://example.com/tillie')

 print(link_node.name,link_node['href'],link_node.get_text())

 print('正则表达式匹配')

 link_node2 = soup.find('a',href=re.compile(r'lsi'))

 print(link_node2.name,link_node2['href'],link_node2.get_text())

 print('获取P段落文字')

 p_node = soup.find('p',class_='title')

 print(p_node.name,p_node.get_text())

控制台输出：

 1 获取所有的连接

 2 a http://example.com/elsie Elsie

 3 a http://example.com/lacie Lacie

 4 a http://example.com/tillie Tillie

 5 获取tillie的连接

 6 a http://example.com/tillie Tillie

 7 正则表达式匹配

 8 a http://example.com/elsie Elsie

 9 获取P段落文字

10 p The Dormouse's story

更高级的爬虫还会涉及到“需登陆、验证码、Ajax、服务器防爬虫、多线程、分布式”等情况

Python--开发简单爬虫

秒客网

Python--开发简单爬虫

相关文章