爬虫系列之BeautifulSoup

BeautifulSoup是处理爬虫的一个强大工具，在HTML页面中，是由各种标签构成的，BeautifulSoup的功能就是从标签下手的，它是解析、遍历、维护“标签树”的功能库。

BeautifulSoup的基本元素如下：

1. 基本格式如下：

 from bs4 import BeautifulSoup

 import requests

 url = "http://python123.io/ws/demo.html"

 r = requests.get(url)

 demo = r.text

 soup = BeautifulSoup(demo, "html.parser")  #将爬到的内容进行解析，demo就是内容，"html.parser"是解析器，按照html格式来进行解析

 print(soup.prettify())  #输出解析得到的内容

解析效果如下：

爬虫系列之BeautifulSoup

2. 具体使用方法如下：

 >>> from bs4 import BeautifulSoup

 >>> import requests

 >>> url = "http://python123.io/ws/demo.html"

 >>> r = requests.get(url)

 >>> demo = r.text

 >>> soup = BeautifulSoup(demo, "html.parser")

 >>> soup.title  #显示标题

 <title>This is a python demo page</title>

 >>> soup.a  #显示a标签内容

 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

 >>> soup.a.name  #显示a标签名字

 'a'

 >>> soup.a.parent.name  #显示a标签父标签名字

 'p'

 >>> soup.a.parent.parent.name  #显示a标签父标签的父标签名字

 'body'

 >>> soup.a.attrs  #获得a标签的属性

 {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

 >>> soup.a.attrs["class"]  #因为是字典，所以这里用字典形式可以获得各个属性的值

 ['py1']

 >>> type(soup.a.attrs)

 <class 'dict'>   #字典类型

 >>> soup.a.string  #获得a标签中的字符内容

 'Basic Python'

 >>> soup   #soup内容如下

 <html><head><title>This is a python demo page</title></head>

 <body>

 <p class="title"><b>The demo python introduces several python courses.</b></p>

 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

 </body></html>

 >>>

3. 标签树的下行遍历

爬虫系列之BeautifulSoup

 >>> soup.head  #获取soup的head

 <head><title>This is a python demo page</title></head>

 >>> soup.head.contents  #获取head的儿子结点

 [<title>This is a python demo page</title>]

 >>> soup.body.contents  #获取body的儿子结点

 ['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p 

 class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to 

 professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" 

 id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" 

 id="link2">Advanced Python</a>.</p>, '\n']

 >>>

 >>> len(soup.body.contents)  #获取儿子结点的个数

 5

 >>>

4. 标签树的上行遍历

爬虫系列之BeautifulSoup