python 中的 BeautifulSoup 网页使用方法解析_Python

一、安装

Bautiful Soup 是第三方库，因此需要单独下载，下载方式非常简单
由于 BS4 解析页面时需要依赖文档解析器，所以还需要安装 lxml 作为解析库
Python 也自带了一个文档解析库 html.parser，但是其解析速度要稍慢于 lxml

				?

									pip install bs4

									pip install lxml

									pip install html5lib

二、html.parser解析

html.parser 表示解析文档时所用的解析器
解析器也可以是 lxml 或者 html5lib

				?

									html = '''

									<div class="modal-dialog">

									<div class="modal-content">

									<div class="modal-header">

									<button type="button" class="close" data-dismiss="modal">&times;</button>

									<h4 class="modal-title">Modal title</h4>

									</div>

									<div class="modal-body">

									...

									</div>

									<div class="modal-footer">

									<a href="#" rel="external nofollow"  rel="external nofollow"  class="btn btn-default" data-dismiss="modal">Close</a>

									<a href="#" rel="external nofollow"  rel="external nofollow"  class="btn btn-primary">Save</a>

									</div>

									</div>

									</div>

									'''

									from bs4 import BeautifulSoup

									soup = BeautifulSoup(html, 'html.parser')

									#prettify()用于格式化输出html/xml文档

									print(soup.prettify())

三、外部文档解析

外部文档，您也可以通过 open 的方式打开读取

				?

									from bs4 import BeautifulSoup

									fp = open('html_doc.html', encoding='utf8')

									soup = BeautifulSoup(fp, 'lxml')

四、标签选择器

标签（Tag）是组成 HTML 文档的基本元素
通过标签名和标签属性可以提取出想要的内容

				?

									from bs4 import BeautifulSoup

									soup = BeautifulSoup('<p class="name nickname user"><b>i am autofelix</b></p>', 'html.parser')

									#获取整个p标签的html代码

									print(soup.p)

									#获取b标签

									print(soup.p.b)

									#获取p标签内容，使用NavigableString类中的string、text、get_text()

									print(soup.p.text)

									#返回一个字典，里面是多有属性和值

									print(soup.p.attrs)

									#查看返回的数据类型

									print(type(soup.p))

									#根据属性，获取标签的属性值，返回值为列表

									print(soup.p['class'])

									#给class属性赋值,此时属性值由列表转换为字符串

									soup.p['class']=['Web','Site']

									print(soup.p)

五、css选择器

支持大部分的 CSS 选择器，比如常见的标签选择器、类选择器、id 选择器，以及层级选择器
通过向 select 方法中添加选择器，就可以在 HTML 文档中搜索到与之对应的内容

				?

									html = """

									<html>

									<head>

									<title>零基础学编程</title>

									</head>

									<body>

									<p class="intro"><b>i am autofelix</b></p>

									<p class="nickname">飞兔小哥</p>

									<a href="https://autofelix.blog.csdn.net" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="csdn">csdn主页</a>

									<a href="https://xie.infoq.cn/u/autofelix/publish" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="infoq">infoq主页</a>

									<a href="https://blog.51cto.com/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="51cto">51cto主页</a>

									<p class="attention">跪求关注 一键三连</p>

									<p class="introduce">

									<a href="https://www.cnblogs.com/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="cnblogs">博客园主页</a>

									</p>

									</body>

									</html>

									"""

									from bs4 import BeautifulSoup

									soup = BeautifulSoup(html, 'html.parser')

									#根据元素标签查找

									print(soup.select('nickname'))

									#根据属性选择器查找

									print(soup.select('a[href]'))

									#根据类查找

									print(soup.select('.attention'))

									#后代节点查找

									print(soup.select('html head title'))

									#查找兄弟节点

									print(soup.select('p + a'))

									#根据id选择p标签的兄弟节点

									print(soup.select('p ~ #csdn'))

									#nth-of-type(n)选择器，用于匹配同类型中的第n个同级兄弟元素

									print(soup.select('p ~ a:nth-of-type(1)'))

									#查找子节点

									print(soup.select('p > a'))

									print(soup.select('.introduce > #cnblogs'))

六、节点遍历

可以使用 contents、children 用来遍历子节点
可以使用 parent 与 parents 用来遍历父节点
可以使用 next_sibling 与 previous_sibling 用来遍历兄弟节点

				?

									html = """

									<html>

									<head>

									<title>零基础学编程</title>

									</head>

									<body>

									<p class="intro"><b>i am autofelix</b></p>

									<p class="nickname">飞兔小哥</p>

									<a href="https://autofelix.blog.csdn.net" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="csdn">csdn主页</a>

									<a href="https://xie.infoq.cn/u/autofelix/publish" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="infoq">infoq主页</a>

									<a href="https://blog.51cto.com/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="51cto">51cto主页</a>

									<p class="attention">跪求关注 一键三连</p>

									<p class="introduce">

									<a href="https://www.cnblogs.com/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="cnblogs">博客园主页</a>

									</p>

									</body>

									</html>

									"""

									from bs4 import BeautifulSoup

									soup = BeautifulSoup(html, 'html.parser')

									body_tag=soup.body

									print(body_tag)

									# 以列表的形式输出，所有子节点

									print(body_tag.contents)

									# children 用来遍历子节点

									for child in body_tag.children:

									print(child)

七、find_all方法

是解析 HTML 文档的常用方法
find_all() 方法用来搜索当前 tag 的所有子节点
并判断这些节点是否符合过滤条件
最后以列表形式将符合条件的内容返回

				?

									html = """

									<html>

									<head>

									<title>零基础学编程</title>

									</head>

									<body>

									<p class="intro"><b>i am autofelix</b></p>

									<p class="nickname">飞兔小哥</p>

									<a href="https://autofelix.blog.csdn.net" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="csdn">csdn主页</a>

									<a href="https://xie.infoq.cn/u/autofelix/publish" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="infoq">infoq主页</a>

									<a href="https://blog.51cto.com/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="51cto">51cto主页</a>

									<p class="attention">跪求关注 一键三连</p>

									<p class="introduce">

									<a href="https://www.cnblogs.com/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="cnblogs">博客园主页</a>

									</p>

									</body>

									</html>

									"""

									import re

									from bs4 import BeautifulSoup

									# 创建soup解析对象

									soup = BeautifulSoup(html, 'html.parser')

									# 查找所有a标签并返回

									print(soup.find_all("a"))

									# 查找前两条a标签并返回,只返回两条a标签

									print(soup.find_all("a",limit=2))

									# 按照标签属性以及属性值查找

									print(soup.find_all("p",class_="nickname"))

									print(soup.find_all(id="infoq"))

									# 列表行书查找tag标签

									print(soup.find_all(['b','a']))

									# 正则表达式匹配id属性值

									print(soup.find_all('a',id=re.compile(r'.\d')))

									print(soup.find_all(id=True))

									# True可以匹配任何值，下面代码会查找所有tag，并返回相应的tag名称

									for tag in soup.find_all(True):

									print(tag.name,end=" ")

									# 输出所有以b开始的tag标签

									for tag in soup.find_all(re.compile("^b")):

									print(tag.name)

									# 简化前写法

									soup.find_all("a")

									# 简化后写法

									soup("a")

八、find方法

				?

									html = """

									<html>

									<head>

									  <title>零基础学编程</title>

									</head>

									<body>

									  <p class="intro"><b>i am autofelix</b></p>

									  <p class="nickname">飞兔小哥</p>

									  <a href="https://autofelix.blog.csdn.net" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="csdn">csdn主页</a>

									  <a href="https://xie.infoq.cn/u/autofelix/publish" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="infoq">infoq主页</a>

									  <a href="https://blog.51cto.com/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="51cto">51cto主页</a>

									  <p class="attention">跪求关注 一键三连</p>

									  <p class="introduce">

									    <a href="https://www.cnblogs.com/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  id="cnblogs">博客园主页</a>

									  </p>

									</body>

									</html>

									"""

									import re

									from bs4 import BeautifulSoup

									# 创建soup解析对象

									soup = BeautifulSoup(html, 'html.parser')

									# 查找第一个a并直接返回结果

									print(soup.find('a'))

									# 查找title

									print(soup.find('intro'))

									# 匹配指定href属性的a标签

									print(soup.find('a',href='https://autofelix.blog.csdn.net'))

									# 根据属性值正则匹配

									print(soup.find(class_=re.compile('tro')))

									# attrs参数值

									print(soup.find(attrs={'class': 'introduce'}))

									# 使用 find 时，如果没有找到查询标签会返回 None，而 find_all 方法返回空列表

									print(soup.find('aa'))

									print(soup.find_all('bb'))

									# 简化写法

									print(soup.head.title)

									# 上面代码等价于

									print(soup.find("head").find("title"))