我用Python 连夜 离线了100G图片，只为了防止网站被消失

网友投稿 604 2022-05-29

用 Python 爬取 100G Cosers 图片

本篇博客目标

爬取目标

目标数据源：http://www.cosplay8.com/pic/chinacos/，又是一个 Cos 网站，该类网站很容易消失在互联网中，为了让数据存储下来，我们盘它。

使用的 Python 模块

requests，re，os

重点学习内容

今日的重点学习，可放在详情页分页抓取上，该技巧在之前的博客中没有涉及，编写代码过程中重点照顾一下。

列表页与详情页分析

通过开发者工具，可以便捷的分析出目标数据所在的标签。

点击任意图片，进入详情页，得到目标图片为单页展示，即每页展示一张图片。

同时获取列表页与详情页 URL 生成规则如下：

列表页

http://www.cosplay8.com/pic/chinacos/list_22_1.html

http://www.cosplay8.com/pic/chinacos/list_22_2.html

http://www.cosplay8.com/pic/chinacos/list_22_3.html

详情页

http://www.cosplay8.com/pic/chinacos/2021/0601/61823.html

http://www.cosplay8.com/pic/chinacos/2021/0601/61823_2.html

http://www.cosplay8.com/pic/chinacos/2021/0601/61823_3.html

注意详情页首页无序号 1，顾爬取获取总页码的同时，需存储首页图片。

编码时间

目标网站对图片进行了分类，即国内 cos，国外 cos，汉服圈，Lolita，因此在爬取时可以对其进行动态输入，即爬取目标源自定义。

def run(category, start, end): # 生成待爬取的列表页 wait_url = [ f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)] print(wait_url) url_list = [] for item in wait_url: # get_list 函数在后文提供 ret = get_list(item) print(f"已经抓取：{len(ret)} 条数据") url_list.extend(ret) if __name__ == "__main__": # http://www.cosplay8.com/pic/chinacos/list_22_2.html category = input("请输入分类编号：") start = input("请输入起始页：") end = input("请输入结束页：") run(category, start, end)

上述代码首先基于用户的输入，生成目标网址，然后将目标网址一次传递到 get_list 函数中，该函数代码如下：

def get_list(url): """ 获取全部详情页链接 """ all_list = [] res = requests.get(url, headers=headers) html = res.text pattern = re.compile('

') all_list = pattern.findall(html) return all_list

通过正则表达式

匹配列表页中所有详情页地址，并将其进行整体返回。

在 run 函数中继续增加代码，获取详情页图片素材，并对抓取到的图片进行保存。

def run(category, start, end): # 待爬取的列表页 wait_url = [ f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)] print(wait_url) url_list = [] for item in wait_url: ret = get_list(item) print(f"已经抓取：{len(ret)} 条数据") url_list.extend(ret) print(url_list) # print(len(url_list)) for url in url_list: get_detail(f"http://www.cosplay8.com{url}")

由于匹配到的详情页地址为相对地址，顾对地址进行格式化操作，生成完整地址。

get_detail 函数代码如下：

def get_detail(url): # 请求详情页数据 res = requests.get(url=url, headers=headers) # 设置编码 res.encoding = "utf-8" # 得到网页源码 html = res.text # 拆解页码，保存第一张图片 size_pattern = re.compile('共(\d+)页: ') # 获取标题，后续发现发表存在差异，顾正则表达式有修改 # title_pattern = re.compile('(.*?)-Cosplay中国') title_pattern = re.compile('(.*?)-Cosplay(中国|8)') # 设置图片正则表达式 first_img_pattern = re.compile(" title_pattern.search(html).group(1)

上述代码核心逻辑已经编写到注释中，重点在 title 正则匹配部分，初始编写正则表达式如下：

(.*?)-Cosplay中国

后续发现不能全部匹配成功，修改为如下内容：

(.*?)-Cosplay(中国|8)

，缺少的 save_img 函数代码如下：

def save_img(path, title, first_img, index): try: # 请求图片 img_res = requests.get(f"http://www.cosplay8.com{first_img}", headers=headers) img_data = img_res.content with open(f"{path}/{title}_{index}.png", "wb+") as f: f.write(img_data) except Exception as e: print(e)

完整代码-：https://codechina.csdn.net/hihell/python120，No6。

Python 网站

标签：Python 连夜离线

我用Python 连夜 离线了100G图片，只为了防止网站被消失

python如何删除excel中不满足要求的工作簿（excel怎么删除工作表里不需要的）

怎么把python程序附在文档上（如何利用python在一个文档里写入）

如何删除离线打印机未答应文档（打印机正在删除无法打印）

推荐文章

企业生产管理是什么，企业生产管理软件

进盘点进销存软件排行榜前十名

进销存系统哪个简单好用？进销存系统优点

工厂生产管理（工厂生产管理流程及制度）

生产管理软件，机械制造业生产管理，制造业生产过程管理软件

进销存软件和ERP有什么区别？进销存与erp软件理解

进销存如何进行库存管理

如何利用excel制作销售订单管理系统？

数据库订单管理系统有哪些功能？数据库订单管理系统怎么设计？

什么是数据库管理系统？

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜

智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐

WPS2016怎么绘制简单的价格表?

什么是在线文档？怎么发在线文档

简单进销存库存管理盘点

友情链接

我用Python连夜离线了100G图片，只为了防止网站被消失

推荐文章

最近发表

热评文章

友情链接

我用Python 连夜离线了100G图片，只为了防止网站被消失