一、安装
- Bautiful Soup 是第三方库,因此需要单独下载,下载方式非常简单
- 由于 BS4 解析页面时需要依赖文档解析器,所以还需要安装 lxml 作为解析库
- Python 也自带了一个文档解析库 html.parser, 但是其解析速度要稍慢于 lxml
1
2
3
|
pip install bs4 pip install lxml pip install html5lib |
二、html.parser解析
- html.parser 表示解析文档时所用的解析器
- 解析器也可以是 lxml 或者 html5lib
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = ''' <div class="modal-dialog"> <div class="modal-content"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal">×</button> <h4 class="modal-title">Modal title</h4> </div> <div class="modal-body"> ... </div> <div class="modal-footer"> <a href="#" rel="external nofollow" rel="external nofollow" class="btn btn-default" data-dismiss="modal">Close</a> <a href="#" rel="external nofollow" rel="external nofollow" class="btn btn-primary">Save</a> </div> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser' ) #prettify()用于格式化输出html/xml文档 print (soup.prettify()) |
三、外部文档解析
- 外部文档,您也可以通过 open 的方式打开读取
1
2
3
|
from bs4 import BeautifulSoup fp = open ( 'html_doc.html' , encoding = 'utf8' ) soup = BeautifulSoup(fp, 'lxml' ) |
四、标签选择器
- 标签(Tag)是组成 HTML 文档的基本元素
- 通过标签名和标签属性可以提取出想要的内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
from bs4 import BeautifulSoup soup = BeautifulSoup( '<p class="name nickname user"><b>i am autofelix</b></p>' , 'html.parser' ) #获取整个p标签的html代码 print (soup.p) #获取b标签 print (soup.p.b) #获取p标签内容,使用NavigableString类中的string、text、get_text() print (soup.p.text) #返回一个字典,里面是多有属性和值 print (soup.p.attrs) #查看返回的数据类型 print ( type (soup.p)) #根据属性,获取标签的属性值,返回值为列表 print (soup.p[ 'class' ]) #给class属性赋值,此时属性值由列表转换为字符串 soup.p[ 'class' ] = [ 'Web' , 'Site' ] print (soup.p) |
五、css选择器
- 支持大部分的 CSS 选择器,比如常见的标签选择器、类选择器、id 选择器,以及层级选择器
- 通过向 select 方法中添加选择器,就可以在 HTML 文档中搜索到与之对应的内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
html = """ <html> <head> <title>零基础学编程</title> </head> <body> <p class="intro"><b>i am autofelix</b></p> <p class="nickname">飞兔小哥</p> <a href="https://autofelix.blog.csdn.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="csdn">csdn主页</a> <a href="https://xie.infoq.cn/u/autofelix/publish" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="infoq">infoq主页</a> <a href="https://blog.51cto.com/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="51cto">51cto主页</a> <p class="attention">跪求关注 一键三连</p> <p class="introduce"> <a href="https://www.cnblogs.com/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="cnblogs">博客园主页</a> </p> </body> </html> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser' ) #根据元素标签查找 print (soup.select( 'nickname' )) #根据属性选择器查找 print (soup.select( 'a[href]' )) #根据类查找 print (soup.select( '.attention' )) #后代节点查找 print (soup.select( 'html head title' )) #查找兄弟节点 print (soup.select( 'p + a' )) #根据id选择p标签的兄弟节点 print (soup.select( 'p ~ #csdn' )) #nth-of-type(n)选择器,用于匹配同类型中的第n个同级兄弟元素 print (soup.select( 'p ~ a:nth-of-type(1)' )) #查找子节点 print (soup.select( 'p > a' )) print (soup.select( '.introduce > #cnblogs' )) |
六、节点遍历
- 可以使用 contents、children 用来遍历子节点
- 可以使用 parent 与 parents 用来遍历父节点
- 可以使用 next_sibling 与 previous_sibling 用来遍历兄弟节点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
html = """ <html> <head> <title>零基础学编程</title> </head> <body> <p class="intro"><b>i am autofelix</b></p> <p class="nickname">飞兔小哥</p> <a href="https://autofelix.blog.csdn.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="csdn">csdn主页</a> <a href="https://xie.infoq.cn/u/autofelix/publish" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="infoq">infoq主页</a> <a href="https://blog.51cto.com/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="51cto">51cto主页</a> <p class="attention">跪求关注 一键三连</p> <p class="introduce"> <a href="https://www.cnblogs.com/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="cnblogs">博客园主页</a> </p> </body> </html> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser' ) body_tag = soup.body print (body_tag) # 以列表的形式输出,所有子节点 print (body_tag.contents) # children 用来遍历子节点 for child in body_tag.children: print (child) |
七、find_all方法
- 是解析 HTML 文档的常用方法
- find_all() 方法用来搜索当前 tag 的所有子节点
- 并判断这些节点是否符合过滤条件
- 最后以列表形式将符合条件的内容返回
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
html = """ <html> <head> <title>零基础学编程</title> </head> <body> <p class="intro"><b>i am autofelix</b></p> <p class="nickname">飞兔小哥</p> <a href="https://autofelix.blog.csdn.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="csdn">csdn主页</a> <a href="https://xie.infoq.cn/u/autofelix/publish" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="infoq">infoq主页</a> <a href="https://blog.51cto.com/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="51cto">51cto主页</a> <p class="attention">跪求关注 一键三连</p> <p class="introduce"> <a href="https://www.cnblogs.com/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="cnblogs">博客园主页</a> </p> </body> </html> """ import re from bs4 import BeautifulSoup # 创建soup解析对象 soup = BeautifulSoup(html, 'html.parser' ) # 查找所有a标签并返回 print (soup.find_all( "a" )) # 查找前两条a标签并返回,只返回两条a标签 print (soup.find_all( "a" ,limit = 2 )) # 按照标签属性以及属性值查找 print (soup.find_all( "p" , class_ = "nickname" )) print (soup.find_all( id = "infoq" )) # 列表行书查找tag标签 print (soup.find_all([ 'b' , 'a' ])) # 正则表达式匹配id属性值 print (soup.find_all( 'a' , id = re. compile (r '.\d' ))) print (soup.find_all( id = True )) # True可以匹配任何值,下面代码会查找所有tag,并返回相应的tag名称 for tag in soup.find_all( True ): print (tag.name,end = " " ) # 输出所有以b开始的tag标签 for tag in soup.find_all(re. compile ( "^b" )): print (tag.name) # 简化前写法 soup.find_all( "a" ) # 简化后写法 soup( "a" ) |
八、find方法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
html = """ <html> <head> <title>零基础学编程</title> </head> <body> <p class="intro"><b>i am autofelix</b></p> <p class="nickname">飞兔小哥</p> <a href="https://autofelix.blog.csdn.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="csdn">csdn主页</a> <a href="https://xie.infoq.cn/u/autofelix/publish" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="infoq">infoq主页</a> <a href="https://blog.51cto.com/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="51cto">51cto主页</a> <p class="attention">跪求关注 一键三连</p> <p class="introduce"> <a href="https://www.cnblogs.com/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="cnblogs">博客园主页</a> </p> </body> </html> """ import re from bs4 import BeautifulSoup # 创建soup解析对象 soup = BeautifulSoup(html, 'html.parser' ) # 查找第一个a并直接返回结果 print (soup.find( 'a' )) # 查找title print (soup.find( 'intro' )) # 匹配指定href属性的a标签 print (soup.find( 'a' ,href = 'https://autofelix.blog.csdn.net' )) # 根据属性值正则匹配 print (soup.find( class_ = re. compile ( 'tro' ))) # attrs参数值 print (soup.find(attrs = { 'class' : 'introduce' })) # 使用 find 时,如果没有找到查询标签会返回 None,而 find_all 方法返回空列表 print (soup.find( 'aa' )) print (soup.find_all( 'bb' )) # 简化写法 print (soup.head.title) # 上面代码等价于 print (soup.find( "head" ).find( "title" )) |
到此这篇关于python 中的 BeautifulSoup 网页解析的文章就介绍到这了,更多相关BeautifulSoup 网页内容请搜索服务器之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持服务器之家!
原文链接:https://blog.51cto.com/autofelix/5248473