爬虫系列之BeautifulSoup

时间:2023-12-31 19:43:26

BeautifulSoup是处理爬虫的一个强大工具,在HTML页面中,是由各种标签构成的,BeautifulSoup的功能就是从标签下手的,它是解析、遍历、维护“标签树”的功能库。

BeautifulSoup的基本元素如下:

爬虫系列之BeautifulSoup

1. 基本格式如下:

 from bs4 import BeautifulSoup
import requests url = "http://python123.io/ws/demo.html" r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo, "html.parser") #将爬到的内容进行解析,demo就是内容,"html.parser"是解析器,按照html格式来进行解析
print(soup.prettify()) #输出解析得到的内容

解析效果如下:

爬虫系列之BeautifulSoup

2. 具体使用方法如下:

 >>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://python123.io/ws/demo.html"
>>> r = requests.get(url)
>>> demo = r.text
>>> soup = BeautifulSoup(demo, "html.parser")
>>> soup.title #显示标题
<title>This is a python demo page</title>
>>> soup.a #显示a标签内容
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.name #显示a标签名字
'a'
>>> soup.a.parent.name #显示a标签父标签名字
'p'
>>> soup.a.parent.parent.name #显示a标签父标签的父标签名字
'body'
>>> soup.a.attrs #获得a标签的属性
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> soup.a.attrs["class"] #因为是字典,所以这里用字典形式可以获得各个属性的值
['py1']
>>> type(soup.a.attrs)
<class 'dict'> #字典类型
>>> soup.a.string #获得a标签中的字符内容
'Basic Python'
>>> soup #soup内容如下
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>>

3. 标签树的下行遍历

爬虫系列之BeautifulSoup

爬虫系列之BeautifulSoup

 >>> soup.head  #获取soup的head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents #获取head的儿子结点
[<title>This is a python demo page</title>]
>>> soup.body.contents #获取body的儿子结点
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>>
>>> len(soup.body.contents) #获取儿子结点的个数
5
>>>

4. 标签树的上行遍历

爬虫系列之BeautifulSoup

爬虫系列之BeautifulSoup

5. 标签树的平行遍历

爬虫系列之BeautifulSoup

平行遍历发生在同一个父节点下的各节点间。

6. find_all()方法

爬虫系列之BeautifulSoup

soup内容如下:
爬虫系列之BeautifulSoup

1、查找soup中所有的a标签:

爬虫系列之BeautifulSoup

2、同时查找soup中所有的a标签和b标签

爬虫系列之BeautifulSoup

3、recursive参数对子孙全部检索

爬虫系列之BeautifulSoup

4、正则表示式查找 b 开头的标签

爬虫系列之BeautifulSoup

4、用name和attrs两个参数查找含有指定字符串的标签

爬虫系列之BeautifulSoup

爬虫系列之BeautifulSoup

5、string参数检索指定字符串

爬虫系列之BeautifulSoup