beautiful soup库的安装
1
|
pip install beautifulsoup4
|
beautiful soup库的理解
beautiful soup库是解析、遍历、维护“标签树”的功能库
beautiful soup库的引用
1
2
|
from bs4 import BeautifulSoup
import bs4
|
BeautifulSoup类
BeautifulSoup对应一个HTML/XML文档的全部内容
回顾demo.html
1
2
3
4
5
|
import requests
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
print (demo)
|
1
2
3
4
5
6
|
< html >< head >< title >This is a python demo page</ title ></ head >
< body >
< p class = "title" >< b >The demo python introduces several python courses.</ b ></ p >
< p class = "course" >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
< a href = "http://www.icourse163.org/course/BIT-268001" class = "py1" id = "link1" >Basic Python</ a > and < a href = "http://www.icourse163.org/course/BIT-1001870001" class = "py2" id = "link2" >Advanced Python</ a >.</ p >
</ body ></ html >
|
Tag标签
基本元素 | 说明 |
---|---|
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
1
2
3
4
5
6
7
8
|
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
print (soup.title)
tag = soup.a
print (tag)
|
1
2
|
< title >This is a python demo page</ title >
< a href = "http://www.icourse163.org/course/BIT-268001" >Basic Python</ a >
|
任何存在于HTML语法中的标签都可以用soup.访问获得。当HTML文档中存在多个相同对应内容时,soup.返回第一个
Tag的name
基本元素 | 说明 |
---|---|
Name |
标签的名字,
… 的名字是'p',格式:.name |
1
2
3
4
5
6
7
8
|
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
print (soup.a.name)
print (soup.a.parent.name)
print (soup.a.parent.parent.name)
|
1
2
3
|
a
p
body
|
Tag的attrs(属性)
基本元素 | 说明 |
---|---|
Attributes | 标签的属性,字典形式组织,格式:.attrs |
1
2
3
4
5
6
7
8
9
10
11
|
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
tag = soup.a
print (tag.attrs)
print (tag.attrs[ 'class' ])
print (tag.attrs[ 'href' ])
print ( type (tag.attrs))
print ( type (tag))
|
1
2
3
4
5
|
{ 'href' : 'http://www.icourse163.org/course/BIT-268001' , 'class' : [ 'py1' ], 'id' : 'link1' }
[ 'py1' ]
http: / / www.icourse163.org / course / BIT - 268001
< class 'dict' >
< class 'bs4.element.Tag' >
|
Tag的NavigableString
Tag的NavigableString
基本元素 | 说明 |
---|---|
NavigableString | 标签内非属性字符串,<>…</>中字符串,格式:.string |
Tag的Comment
基本元素 | 说明 |
---|---|
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
1
2
3
4
5
6
7
|
import requests
from bs4 import BeautifulSoup
newsoup = BeautifulSoup( "<b><!--This is a comment--></b><p>This is not a comment</p>" , "html.parser" )
print (newsoup.b.string)
print ( type (newsoup.b.string))
print (newsoup.p.string)
print ( type (newsoup.p.string))
|
1
2
3
4
|
This is a comment
< class 'bs4.element.Comment' >
This is not a comment
< class 'bs4.element.NavigableString' >
|
HTML基本格式
标签树的下行遍历
属性 | 说明 |
---|---|
.contents | 子节点的列表,将所有儿子结点存入列表 |
.children | 子节点的迭代类型,与.contents类似,用于循环遍历儿子结点 |
.descendents | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
BeautifulSoup类型是标签树的根节点
1
2
3
4
5
6
7
8
9
10
11
|
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
print (soup.head)
print (soup.head.contents)
print (soup.body.contents)
print ( len (soup.body.contents))
print (soup.body.contents[ 1 ])
|
1
2
3
4
5
6
7
8
|
<head><title>This is a python demo page< / title>< / head>
[<title>This is a python demo page< / title>]
[ '\n' , <p ><b>The demo python introduces several python courses.< / b>< / p>, '\n' , <p >Python
is a wonderful general - purpose programming language. You can learn Python from novice to professional by tracking the
following courses:
<a href = "http://www.icourse163.org/course/BIT-268001" >Basic Python< / a> and <a href = "http://www.icourse163.org/course/BIT-1001870001" >Advanced Python< / a>.< / p>, '\n' ]
5
<p ><b>The demo python introduces several python courses.< / b>< / p>
|
1
2
3
4
|
for child in soup.body.children:
print (child) #遍历儿子结点
for child in soup.body.descendants:
print (child) #遍历子孙节点
|
标签树的上行遍历
属性 | 说明 |
---|---|
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
1
2
3
4
5
6
7
8
|
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
print (soup.title.parent)
print (soup.html.parent)
|
1
2
3
4
5
6
7
|
< head >< title >This is a python demo page</ title ></ head >
< html >< head >< title >This is a python demo page</ title ></ head >
< body >
< p >< b >The demo python introduces several python courses.</ b ></ p >
< p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
< a href = "http://www.icourse163.org/course/BIT-268001" >Basic Python</ a > and < a href = "http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</ a >.</ p >
</ body ></ html >
|
1
2
3
4
5
6
7
8
9
10
11
|
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
for parent in soup.a.parents:
if parent is None :
print (parent)
else :
print (parent.name)
|
1
2
3
4
|
p
body
html
[document]
|
标签的平行遍历
属性 | 说明 |
---|---|
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous.sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous.siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
print (soup.a.next_sibling)
print (soup.a.next_sibling.next_sibling)
print (soup.a.previous_sibling)
print (soup.a.previous_sibling.previous_sibling)
print (soup.a.parent)
|
1
2
3
4
5
6
7
|
and
<a class = "py2" href = "http://www.icourse163.org/course/BIT-1001870001" id = "link2" >Advanced Python< / a>
Python is a wonderful general - purpose programming language. You can learn Python from novice to professional by tracking the following courses:
None
<p class = "course" >Python is a wonderful general - purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class = "py1" href = "http://www.icourse163.org/course/BIT-268001" id = "link1" >Basic Python< / a> and <a class = "py2" href = "http://www.icourse163.org/course/BIT-1001870001" id = "link2" >Advanced Python< / a>.< / p>
|
1
2
3
4
|
for sibling in soup.a.next_sibling:
print (sibling) #遍历后续节点
for sibling in soup.a.previous_sibling:
print (sibling) #遍历前续节点
|
bs库的prettify()方法
1
2
3
4
5
6
7
|
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
print (soup.prettify())
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
< html >
< head >
< title >
This is a python demo page
</ title >
</ head >
< body >
< p class = "title" >
< b >
The demo python introduces several python courses.
</ b >
</ p >
< p class = "course" >
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
Basic Python
</ a >
and
< a class = "py2" href = "http://www.icourse163.org/course/BIT-1001870001" id = "link2" >
Advanced Python
</ a >
.
</ p >
</ body >
</ html >
|
.prettify()为HTML文本<>及其内容增加更加'\n'
.prettify()可用于标签,方法:.prettify()
bs4库的编码
bs4库将任何HTML输入都变成utf-8编码
python 3.x默认支持编码是utf-8,解析无障碍
1
2
3
4
5
6
7
|
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup( "<p>中文</p>" , "html.parser" )
print (soup.p.string)
print (soup.p.prettify())
|
1
2
3
4
5
|
中文
<p>
中文
< / p>
|
到此这篇关于python beautiful soup库入门安装教程的文章就介绍到这了,更多相关python beautiful soup库入门内容请搜索服务器之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持服务器之家!
原文链接:https://blog.****.net/weixin_46530492/article/details/119960182