利用python实现查看溧阳的摄影圈_Python

前言：

本篇博客继续学习BeautifulSoup，目标站点选取“溧阳摄影圈”，这一地方论坛。

目标站点分析

本次要采集的目标站点分页规则如下：

				?

									http://www.jsly001.com/thread-htm-fid-45-page-{页码}.html

代码采用多线程 threading 模块+requests 模块+BeautifulSoup 模块编写。

采取规则依据列表页 → 详情页：

利用python实现查看溧阳的摄影圈

溧阳摄影圈图片采集代码

本案例属于实操案例，先展示完整代码，然后基于注释与重点函数进行说明。

主要实现步骤如下所示：

设置日志输出级别
声明一个 LiYang 类，其继承自 threading.Thread
实例化多线程对象
每个线程都去获取全局资源
调用html解析函数
获取板块主题分割区域，主要为防止获取置顶的主题
使用 lxml 进行解析
解析出标题与数据
解析图片地址
保存图片

				?

									import random

									import threading

									import logging

									from bs4 import BeautifulSoup

									import requests

									import lxml

									logging.basicConfig(level=logging.NOTSET) # 设置日志输出级别

									# 声明一个 LiYang 类，其继承自 threading.Thread

									class LiYangThread(threading.Thread):

									    def __init__(self):

									        threading.Thread.__init__(self) # 实例化多线程对象

									        self._headers = self._get_headers() # 随机获取 ua

									        self._timeout = 5 # 设置超时时间

									    # 每个线程都去获取全局资源

									    def run(self):

									        # while True: # 此处为多线程开启位置

									        try:

									            res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._headers,

									                               timeout=self._timeout) # 测试获取第一页数据

									        except Exception as e:

									            logging.error(e)

									        if res is not None:

									            html_text = res.text

									            self._format_html(html_text) # 调用html解析函数

									    def _format_html(self, html):

									        # 使用 lxml 进行解析

									        soup = BeautifulSoup(html, 'lxml')

									        # 获取板块主题分割区域，主要为防止获取置顶的主题

									        part_tr = soup.find(attrs={'class': 'bbs_tr4'})

									        if part_tr is not None:

									            items = part_tr.find_all_next(attrs={"name": "readlink"}) # 获取详情页地址

									        else:

									            items = soup.find_all(attrs={"name": "readlink"})

									        # 解析出标题与数据

									        data = [(item.text, f'http://www.jsly001.com/{item["href"]}') for item in items]

									        # 进入标题内页

									        for name, url in data:

									            self._get_imgs(name, url)

									    def _get_imgs(self, name, url):

									        """解析图片地址"""

									        try:

									            res = requests.get(url=url, headers=self._headers, timeout=self._timeout)

									        except Exception as e:

									            logging.error(e)

									        # 图片提取逻辑

									        if res is not None:

									            soup = BeautifulSoup(res.text, 'lxml')

									            origin_div1 = soup.find(attrs={'class': 'tpc_content'})

									            origin_div2 = soup.find(attrs={'class': 'imgList'})

									            content = origin_div2 if origin_div2 else origin_div1

									            if content is not None:

									                imgs = content.find_all('img')

									                # print([img.get("src") for img in imgs])

									                self._save_img(name, imgs) # 保存图片

									    def _save_img(self, name, imgs):

									        """保存图片"""

									        for img in imgs:

									            url = img.get("src")

									            if url.find('http') < 0:

									                continue

									            # 寻找父标签中的 id 属性

									            id_ = img.find_parent('span').get("id")

									            try:

									                res = requests.get(url=url, headers=self._headers, timeout=self._timeout)

									            except Exception as e:

									                logging.error(e)

									            if res is not None:

									                name = name.replace("/", "_")

									                with open(f'./imgs/{name}_{id_}.jpg', "wb+") as f: # 注意在 python 运行时目录提前创建 imgs 文件夹

									                    f.write(res.content)

									    def _get_headers(self):

									        uas = [

									            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",

									        ]

									        ua = random.choice(uas)

									        headers = {

									            "user-agent": ua

									        }

									        return headers

									if __name__ == '__main__':

									    my_thread = LiYangThread()

									    my_thread.run()

本次案例采用中，BeautifulSoup 模块采用 lxml 解析器 对 HTML 数据进行解析，后续多采用此解析器，在使用前注意先导入 lxml 模块。

数据提取部分采用 soup.find() 与 soup.find_all() 两个函数进行，代码中还使用了 find_parent() 函数，用于采集父级标签中的 id 属性。

				?

									# 寻找父标签中的 id 属性

									id_ = img.find_parent('span').get("id")

代码运行过程出现 DEBUG 信息，控制 logging 日志输出级别即可。![用python看溧阳摄影圈，里面照片非常真

到此这篇关于利用python实现查看溧阳的摄影圈的文章就介绍到这了,更多相关python查看摄影圈内容请搜索服务器之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持服务器之家！

原文链接：https://juejin.cn/post/7079963621077286925

利用python实现查看溧阳的摄影圈

目标站点分析

溧阳摄影圈图片采集代码

延伸 · 阅读

深入理解PyTorch中的nn.Embedding的使用

Python中Pyenv virtualenv插件的使用

如何解决pycharm调试报错的问题

python实现定时压缩指定文件夹发送邮件

Python详解如何动态给对象增加属性和方法

Python基础练习之用户登录实现代码分享

matplotlib 范围选区(SpanSelector)的使用

Python实现猜拳与猜数字游戏的方法详解

PyCharm设置SSH远程调试的方法

Python安装图文教程 Pycharm安装教程

python是什么意思？python有什么用？

使用Python抓取模板之家的CSS模板

Python 列表(List)操作方法详解