当前位置：首页 > news >正文

在中国备案的网站服务器手机单页网站

news 2025/11/5 8:09:08

在中国备案的网站服务器,手机单页网站,wordpress建博客教程,ai生成图片在线制作文章目录前言一、使用版本二、需求分析1. 分析要爬取的内容1.1 分析要爬取的单个图书信息1.2 爬取步骤1.2.1 爬取豆瓣图书标签分类页面1.2.2 爬取分类页面1.2.3 爬取单个图书页面 1.3 内容所在的标签定位 2. 数据用途2.1 基础分析2.2 高级分析 3. 应对反爬机制的策略3.1 使用 … 文章目录前言一、使用版本二、需求分析1. 分析要爬取的内容1.1 分析要爬取的单个图书信息1.2 爬取步骤1.2.1 爬取豆瓣图书标签分类页面1.2.2 爬取分类页面1.2.3 爬取单个图书页面 1.3 内容所在的标签定位 2. 数据用途2.1 基础分析2.2 高级分析 3. 应对反爬机制的策略3.1 使用 User-Agent 模拟真实浏览器请求3.2 实施随机延时策略3.3 构建和使用代理池3.4 其他三、编写爬虫代码1. 爬取标签分类html2. 爬取单个分类的所有页面3. 爬取单个图书的html 四、数据处理与存储1. 解析html并把数据保存到csv文件1.1 字段说明1.2 代码实现 2. 数据清洗与存储2.1 数据清洗2.2 数据存储2.2.1 表设计2.2.2 表实现 2.3 代码实现前言在数字化时代网络爬虫技术为我们提供了强大的数据获取能力使得从各类网站提取信息变得更加高效和便捷。豆瓣读书作为一个广受欢迎的图书评价和推荐平台汇聚了大量的书籍信息包括书名、作者、出版社、评分等。这些信息不仅对读者选择图书有帮助也为出版商和研究人员提供了宝贵的数据资源。本项目旨在通过 Python 爬虫技术系统性地抓取豆瓣读书网站上的图书信息并将其存储为结构化的数据格式以便后续分析和研究。我们将使用 requests 和 BeautifulSoup 库进行网页请求和数据解析利用 pandas 进行数据处理最后将清洗后的数据存储到 MySQL 数据库中。一、使用版本 pythonrequestsbs4beautifulsoup4soupsievelxmlpandassqlalchemymysql-connector-pythonselenium版本3.8.52.31.00.0.24.12.32.64.9.32.0.32.0.369.0.04.15.2 二、需求分析 1. 分析要爬取的内容 1.1 分析要爬取的单个图书信息点击进入豆瓣读书官网https://book.douban.com/ 随便点开一本图书如下图在图书首页可以看到标题、作者、出版社、出版日期、页数、价格和评分等信息。那我们的目的就是要把这些信息爬取下来保存到csv文件中作为原始数据。鼠标右击选择检查找到相关信息的网页源码。当鼠标悬浮在如下图红色箭头所指的标签上之后我们发现左侧我们想要爬取的信息范围被显示出来说明我们要爬取的单个图书信息内容就在该标签中。复制了该标签的内容如下图所示从该标签中可以看到需要爬取的信息都有。我们的目的就是把单个图书的hmtl文件爬取下来然后使用BeautifulSoup解析后把数据保存到csv文件中。 div classsubjectwrap clearfix div classsubject clearfix div idmainpic classa classnbg hrefhttps://img1.doubanio.com/view/subject/l/public/s34971089.jpg title再造乡土img srchttps://img1.doubanio.com/view/subject/s/public/s34971089.jpg title点击看大图 alt再造乡土 relv:photo stylemax-width: 135px;max-height: 200px;/a /div div idinfo classspanspan classpl 作者/span:a class href/author/4639586美萨拉·法默/a/spanbrspan classpl出版社:/spana hrefhttps://book.douban.com/press/2476广西师范大学出版社/abrspan classpl出品方:/spana hrefhttps://book.douban.com/producers/795望mountain/abrspan classpl副标题:/span 1945年后法国农村社会的衰落与重生brspan classpl原作名:/span Rural Inventions: The French Countryside after 1945brspanspan classpl 译者/span:a class href/search/%E5%8F%B6%E8%97%8F叶藏/a/spanbrspan classpl出版年:/span 2024-11brspan classpl页数:/span 288brspan classpl定价:/span 79.20元brspan classpl装帧:/span 精装brspan classplISBN:/span 9787559874597br /div /div div idinterest_sectl classdiv classrating_wrap clearbox relv:ratingdiv classrating_logo豆瓣评分/divdiv classrating_self clearfix typeofv:Ratingstrong classll rating_num propertyv:average 8.5 /strongspan propertyv:best content10.0/spandiv classrating_right div classll bigstar45/divdiv classrating_sumspan classa hrefcomments classrating_peoplespan propertyv:votes55/span人评价/a/span/div/div/div span classstars5 starstop title力荐5星 /span div classpower stylewidth:37px/divspan classrating_per29.1%/spanbr span classstars4 starstop title推荐4星 /span div classpower stylewidth:64px/divspan classrating_per49.1%/spanbr span classstars3 starstop title还行3星 /span div classpower stylewidth:26px/divspan classrating_per20.0%/spanbr span classstars2 starstop title较差2星 /span div classpower stylewidth:2px/divspan classrating_per1.8%/spanbr span classstars1 starstop title很差1星 /span div classpower stylewidth:0px/divspan classrating_per0.0%/spanbr/div /div /div1.2 爬取步骤 1.2.1 爬取豆瓣图书标签分类页面豆瓣图书标签分类地址https://book.douban.com/tag/?viewtypeicnindex-sorttags-all 爬取图书标签分类页面保存为../douban/douban_book/douban_book_tag/douban_book_all_tag.html文件。然后使用BeautifulSoup解析../douban/douban_book/douban_book_tag/douban_book_all_tag.html文件获取每个分类标签的名称和链接。 1.2.2 爬取分类页面例如点进小说标签后的页面如下可以看到访问的网址是https://book.douban.com/tag/小说由此可以推断不同分类标签第一页的网址是https://book.douban.com/tag/标签名称。在上面的两个页面中可以看到每一页显示了多个小说的大概信息这些信息并不能满足我的爬取要求那我就需要获取每个分页的链接然后根据每个分页的链接保存每一页的html文件。如下图所示检查后发现每一页是20条数据而且带有两个参数start、typestart表示每页开始位置每页20条数据由此可以推断每一页的链接为https://book.douban.com/tag/标签名称?start20的倍数typeT。然后从每一页中解析出每个图书的链接。 1.2.3 爬取单个图书页面获得每个图书的链接后就可以根据链接保存每个图书的html文件。然后就可以使用BeautifulSoup从该页面中解析出图书的信息。单个图书的页面如下图所示 1.3 内容所在的标签定位可以使用CSS选择器定位需要爬取的内容所在的标签位置。示例标题标签定位鼠标右击标题部分选择检查显示出标题部分的源码之后右击有标题的源码点击复制选择复制selector。复制后的selector如下 #wrapper h1 span2. 数据用途 2.1 基础分析描述性统计计算书籍价格、页数等数值型字段的平均值、中位数、最大值、最小值以及标准差。统计不同装帧类型binding或出版社publisher的书籍数量。频率分布制作出版年份publication_year的频率分布图观察每年的出版趋势。分析各星级评分stars5_starstop至stars1_starstop所占的比例了解整体评分分布情况。简单关系探索探索书籍价格与评分之间的简单相关性。研究书籍页数与评分的关系看是否有明显的关联。分类汇总按作者author、出版社publisher或者丛书系列series对书籍进行分组并计算每组的平均评分、总销量等指标。 2.2 高级分析预测建模使用机器学习算法预测一本书的可能评分基于如作者、出版社、价格、出版年份等因素。构建模型预测书籍销售量帮助出版社或书店优化库存管理。聚类分析对书籍进行聚类找出具有相似特征的书籍群体例如相似的主题、读者群体或市场表现。根据用户评论链接中的文本信息进行主题建模以识别常见的读者关注点或反馈类型。因果分析通过控制其他变量研究特定因素如封面设计、翻译质量等对书籍评分或销量的影响。使用实验设计或准实验方法评估营销活动对书籍销量的影响。时间序列分析如果有连续多年的数据可以对出版年份和销量等进行时间序列分析预测未来的趋势。研究特定事件如作者获得奖项对书籍销量的时间影响。网络分析构建作者合作网络或书籍引用网络探索学术或文学领域内的合作模式和影响力传播。情感分析对用户评论链接指向的内容进行情感分析理解读者对书籍的情感倾向。多变量回归分析研究多个变量如价格、页数、出版年份等如何共同影响一本书的评分或销量。 3. 应对反爬机制的策略 3.1 使用 User-Agent 模拟真实浏览器请求许多网站通过检查HTTP请求头中的 User-Agent 字段来判断请求是否来自真实的浏览器。默认情况下Python库发送的请求可能带有明显的标识容易被识别为自动化工具。因此修改 User-Agent 来模拟不同的浏览器和操作系统可以有效地绕过这一检测。 3.2 实施随机延时策略频繁且规律性的请求频率是典型的爬虫行为特征之一。通过在每次请求之间加入随机延迟不仅可以模仿人类用户的访问模式还能减少服务器负载降低被封禁的风险。 3.3 构建和使用代理池直接从同一个IP地址发起大量请求容易引起封禁。通过构建并使用代理池您可以轮换不同的IP地址来进行请求从而分散风险。这不仅增加了爬虫的隐蔽性也减轻了单个IP地址的压力。 3.4 其他验证码处理某些网站可能还会使用验证码来验证用户身份。针对这种情况可以考虑使用第三方OCR服务或专门的验证码识别API。JavaScript渲染页面部分现代网站依赖JavaScript动态加载内容普通的HTML解析可能无法获取完整数据。这时可以使用像Selenium这样的工具它能启动一个真实的浏览器实例执行JavaScript。三、编写爬虫代码 1. 爬取标签分类html 页面如下图所示代码实现 import random import time from pathlib import Pathimport requestsdef get_request(url, **kwargs):time.sleep(random.uniform(0.1, 2))print(f地址{url} )# 定义一组User-Agent字符串user_agents [# ChromeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,# FirefoxMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0,# EdgeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0,# SafariMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15,]# 请求头headers {User-Agent: random.choice(user_agents)}# 用户名密码认证(私密代理/独享代理)username password proxies {http: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768},https: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768}}max_retries 3for attempt in range(max_retries):try:response requests.get(urlurl, timeout10, headersheaders, **kwargs)# response requests.get(urlurl, timeout10, headersheaders, proxiesproxies, **kwargs)if response.status_code 200:return responseelse:print(f请求失败状态码: {response.status_code}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))except requests.exceptions.RequestException as e:print(f请求过程中发生异常: {e}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))# 如果不是最后一次尝试则等待一段时间再重试if attempt max_retries - 1:time.sleep(random.uniform(1, 2))print(多次请求失败请查看异常情况)return None # 或者返回最后一次的响应取决于你的需求def save_book_html_file(save_dir, file_name, content):dir_path Path(save_dir)# 确保保存目录存在如果不存在则创建所有必要的父级目录dir_path.mkdir(parentsTrue, exist_okTrue)# 使用 with 语句打开文件以确保正确关闭文件流with open(save_dir file_name, w, encodingutf-8) as fp:print(f{save_dir file_name} 文件已保存)fp.write(str(content))def download_book_tag():save_dir ../douban/douban_book/douban_book_tag/file_name douban_book_all_tag.htmlbook_tag_url https://book.douban.com/tag/?viewtypeicnindex-sorttags-alltag_file_path Path(save_dir file_name)if tag_file_path.exists() and tag_file_path.is_file():print(f\n文件 {tag_file_path} 已存在)else:print(f文件 {tag_file_path} 不存在正在下载...)save_book_html_file(save_dirsave_dir, file_namefile_name, contentget_request(book_tag_url).text)if __name__ __main__:download_book_tag()运行结果如下图所示该代码可以重复执行重复执行会自动检查文件是否已下载如下图所示保存后的文件如下图 2. 爬取单个分类的所有页面基于上面的爬取标签分类继续实现的代码使用BeautifulSoup解析标签分类html后根据获取的标签分类名称和链接循环获取每个分类下的所有html页面。 import random import time from pathlib import Pathimport requests from bs4 import BeautifulSoup# 快代理试用https://www.kuaidaili.com/freetest/def get_request(url, **kwargs):time.sleep(random.uniform(0.1, 2))print(f地址{url} )# 定义一组User-Agent字符串user_agents [# ChromeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,# FirefoxMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0,# EdgeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0,# SafariMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15,]# 请求头headers {User-Agent: random.choice(user_agents)}# 用户名密码认证(私密代理/独享代理)username 17687015657password qvbgms8wproxies {http: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768},https: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768}}max_retries 3for attempt in range(max_retries):try:response requests.get(urlurl, timeout10, headersheaders, **kwargs)# response requests.get(urlurl, timeout10, headersheaders, proxiesproxies, **kwargs)if response.status_code 200:return responseelse:print(f请求失败状态码: {response.status_code}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))except requests.exceptions.RequestException as e:print(f请求过程中发生异常: {e}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))# 如果不是最后一次尝试则等待一段时间再重试if attempt max_retries - 1:time.sleep(random.uniform(1, 2))print(多次请求失败请查看异常情况)return None # 或者返回最后一次的响应取决于你的需求def save_book_html_file(save_dir, file_name, content):dir_path Path(save_dir)# 确保保存目录存在如果不存在则创建所有必要的父级目录dir_path.mkdir(parentsTrue, exist_okTrue)# 使用 with 语句打开文件以确保正确关闭文件流with open(save_dir file_name, w, encodingutf-8) as fp:print(f{save_dir file_name} 文件已保存)fp.write(str(content))def download_book_tag():save_dir ../douban/douban_book/douban_book_tag/file_name douban_book_all_tag.htmlbook_tag_url https://book.douban.com/tag/?viewtypeicnindex-sorttags-alltag_file_path Path(save_dir file_name)if tag_file_path.exists() and tag_file_path.is_file():print(f\n文件 {tag_file_path} 已存在)else:print(f文件 {tag_file_path} 不存在正在下载...)save_book_html_file(save_dirsave_dir, file_namefile_name, contentget_request(book_tag_url).text)def get_soup(markup):return BeautifulSoup(markupmarkup, featureslxml)def get_book_type_and_href():# 定义HTML文件路径file ../douban/douban_book/douban_book_tag/douban_book_all_tag.html# 初始化一个空字典用于存储标签名称和对应的链接name_href_result {}# 定义豆瓣书籍的基础URL用于拼接完整的链接book_base_url https://book.douban.com# 打开并读取HTML文件内容with open(filefile, moder, encodingutf-8) as fp:# 使用BeautifulSoup解析HTML内容soup get_soup(fp)# 选择包含所有标签链接的主要容器tag soup.select_one(#content div div.article div:nth-child(2))# 选择所有包含标签链接的表格行每个类别下的标签表tables tag.select(div a.tag-title-wrapper table.tagCol)# 遍历每个表格for table in tables:# 选择表格中的所有行tr标签tr_tags table.select(tr)# 遍历每一行for tr_tag in tr_tags:# 选择行中的所有单元格td标签td_tags tr_tag.select(td)# 遍历每个单元格for td_tag in td_tags:# 选择单元格中的第一个a标签如果存在a_tag td_tag.select_one(a)# 如果找到了a标签则提取文本和href属性if a_tag:# 提取a标签的文本内容并去除两端空白字符tag_text a_tag.string# 获取a标签的href属性并与基础URL拼接成完整链接tag_href book_base_url a_tag.attrs.get(href)# 将提取到的标签文本和链接添加到结果字典中name_href_result[tag_text] tag_href# 返回包含所有标签名称和对应链接的字典return name_href_resultdef get_book_data_dagai(name, start):book_tag_base_url https://book.douban.com/tag/ namepayload {start: start,type: T}response get_request(book_tag_base_url, paramspayload)if response is None:return Nonereturn response.textdef download_book_data_dagai(name, start):save_dir ../douban/douban_book/douban_book_data_dagai/file_name fdouban_book_data_dagai_{name}_{start}.htmldagai_file_path Path(save_dir file_name)if dagai_file_path.exists() and dagai_file_path.is_file():print(f文件 {dagai_file_path} 已存在)else:print(f文件 {dagai_file_path} 不存在正在下载...)content get_book_data_dagai(name, start)if content is None:return None# 判断是否是最后一页soup get_soup(content)p_tag soup.select_one(#subject_list p)if p_tag is not None:print(f分类 {name} 的网页爬取完成)return Truesave_book_html_file(save_dirsave_dir, file_namefile_name, contentcontent)if __name__ __main__:download_book_tag()book_type get_book_type_and_href()book_type_name book_type.keys()print(book_type_name)for type_name in book_type_name:print(f图书分类标签{type_name})start_ 0while True:flag download_book_data_dagai(type_name, start_)start_ start_ 20if flag is None:continueif flag:print(f图书分类标签 {type_name} 的大概html下载完成)break执行过程中打印的部分信息如下图所示爬取后保存的部分html文件如下图所示 3. 爬取单个图书的html 基于上面的爬取单个分类的所有页面继续实现的代码使用BeautifulSoup解析每一页的html后根据获取的单个图书链接获取html页面。 import random import time from pathlib import Pathimport requests from bs4 import BeautifulSoup# 快代理试用https://www.kuaidaili.com/freetest/def get_request(url, **kwargs):time.sleep(random.uniform(0.1, 2))print(f地址{url} )# 定义一组User-Agent字符串user_agents [# ChromeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,# FirefoxMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0,# EdgeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0,# SafariMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15,]# 请求头headers {User-Agent: random.choice(user_agents)}# 用户名密码认证(私密代理/独享代理)username password proxies {http: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768},https: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768}}max_retries 3for attempt in range(max_retries):try:response requests.get(urlurl, timeout10, headersheaders, **kwargs)# response requests.get(urlurl, timeout10, headersheaders, proxiesproxies, **kwargs)if response.status_code 200:return responseelse:print(f请求失败状态码: {response.status_code}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))except requests.exceptions.RequestException as e:print(f请求过程中发生异常: {e}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))# 如果不是最后一次尝试则等待一段时间再重试if attempt max_retries - 1:time.sleep(random.uniform(1, 2))print(多次请求失败请查看异常情况)return None # 或者返回最后一次的响应取决于你的需求def save_book_html_file(save_dir, file_name, content):dir_path Path(save_dir)# 确保保存目录存在如果不存在则创建所有必要的父级目录dir_path.mkdir(parentsTrue, exist_okTrue)# 使用 with 语句打开文件以确保正确关闭文件流with open(save_dir file_name, w, encodingutf-8) as fp:print(f{save_dir file_name} 文件已保存)fp.write(str(content))def download_book_tag():save_dir ../douban/douban_book/douban_book_tag/file_name douban_book_all_tag.htmlbook_tag_url https://book.douban.com/tag/?viewtypeicnindex-sorttags-alltag_file_path Path(save_dir file_name)if tag_file_path.exists() and tag_file_path.is_file():print(f\n文件 {tag_file_path} 已存在)else:print(f文件 {tag_file_path} 不存在正在下载...)save_book_html_file(save_dirsave_dir, file_namefile_name, contentget_request(book_tag_url).text)def get_soup(markup):return BeautifulSoup(markupmarkup, featureslxml)def get_book_type_and_href():# 定义HTML文件路径file ../douban/douban_book/douban_book_tag/douban_book_all_tag.html# 初始化一个空字典用于存储标签名称和对应的链接name_href_result {}# 定义豆瓣书籍的基础URL用于拼接完整的链接book_base_url https://book.douban.com# 打开并读取HTML文件内容with open(filefile, moder, encodingutf-8) as fp:# 使用BeautifulSoup解析HTML内容soup get_soup(fp)# 选择包含所有标签链接的主要容器tag soup.select_one(#content div div.article div:nth-child(2))# 选择所有包含标签链接的表格行每个类别下的标签表tables tag.select(div a.tag-title-wrapper table.tagCol)# 遍历每个表格for table in tables:# 选择表格中的所有行tr标签tr_tags table.select(tr)# 遍历每一行for tr_tag in tr_tags:# 选择行中的所有单元格td标签td_tags tr_tag.select(td)# 遍历每个单元格for td_tag in td_tags:# 选择单元格中的第一个a标签如果存在a_tag td_tag.select_one(a)# 如果找到了a标签则提取文本和href属性if a_tag:# 提取a标签的文本内容并去除两端空白字符tag_text a_tag.string# 获取a标签的href属性并与基础URL拼接成完整链接tag_href book_base_url a_tag.attrs.get(href)# 将提取到的标签文本和链接添加到结果字典中name_href_result[tag_text] tag_href# 返回包含所有标签名称和对应链接的字典return name_href_resultdef get_book_data_dagai(name, start):book_tag_base_url https://book.douban.com/tag/ namepayload {start: start,type: T}response get_request(book_tag_base_url, paramspayload)if response is None:return Nonereturn response.textdef download_book_data_dagai(name, start):save_dir ../douban/douban_book/douban_book_data_dagai/file_name fdouban_book_data_dagai_{name}_{start}.htmldagai_file_path Path(save_dir file_name)if dagai_file_path.exists() and dagai_file_path.is_file():print(f文件 {dagai_file_path} 已存在)else:print(f文件 {dagai_file_path} 不存在正在下载...)content get_book_data_dagai(name, start)if content is None:return None# 判断是否是最后一页soup get_soup(content)p_tag soup.select_one(#subject_list p)if p_tag is not None:print(f分类 {name} 的网页爬取完成)return Truesave_book_html_file(save_dirsave_dir, file_namefile_name, contentcontent)def download_book_data_detail():save_dir ../douban/douban_book/douban_book_data_detail/dagai_dir Path(../douban/douban_book/douban_book_data_dagai/)dagai_file_list dagai_dir.rglob(*.html)for dagai_file in dagai_file_list:soup get_soup(markupopen(filedagai_file, moder, encodingutf-8))a_tag_list soup.select(#subject_list ul li h2 a)for a_tag in a_tag_list:href a_tag.attrs.get(href)book_id href.split(/)[-2]file_name fdouban_book_data_detail_{book_id}.htmldetail_file_path Path(save_dir file_name)if detail_file_path.exists() and detail_file_path.is_file():print(f文件 {detail_file_path} 已存在)else:print(f文件 {detail_file_path} 不存在正在下载...)response get_request(href)if response is None:continuesave_book_html_file(save_dir, file_name, response.text)def print_in_rows(items, items_per_row20):for index, name in enumerate(items, start1):print(f{name}, end )if index % items_per_row 0:print()if __name__ __main__:download_book_tag()book_type get_book_type_and_href()book_type_name book_type.keys()print(book_type_name)for type_name in book_type_name:print(f图书分类标签{type_name})start_ 0while True:flag download_book_data_dagai(type_name, start_)start_ start_ 20if flag is None:continueif flag:print(f图书分类标签 {type_name} 的大概html下载完成)breakdownload_book_data_detail()执行过程中打印的部分信息如下图所示爬取后保存的部分html文件如下图所示四、数据处理与存储 1. 解析html并把数据保存到csv文件使用BeautifulSoup从html文档中解析出单个图书的信息循环解析出多个图书数据后把数据保存到csv文件。 1.1 字段说明字段名称说明book_id书籍的唯一标识符。title书名。img_src封面图片的网络地址。author作者姓名。publisher出版社名称。producer制作人或出品方如果有的话。original_title原版书名如果是翻译作品则为原语言书名。translator翻译者姓名如果有。publication_year出版年份。page_count页数。price定价。binding装帧类型如平装、精装等。series丛书系列名称如果有的话。isbn国际标准书号。rating平均评分。rating_sum参与评分的人数。comment_link用户评论链接。stars5_starstop五星评价所占的比例。stars4_starstop四星评价所占的比例。stars3_starstop三星评价所占的比例。stars2_starstop二星评价所占的比例。stars1_starstop一星评价所占的比例。 1.2 代码实现每解析出100条数据就把解析出的数据保存到csv文件中。 from pathlib import Pathimport pandas as pd from bs4 import BeautifulSoupdef get_soup(markup):return BeautifulSoup(markupmarkup, featureslxml)def parse_detail_html_to_csv():# 定义CSV文件路径csv_file_dir ../douban/douban_book/data_csv/csv_file_name douban_books.csvcsv_file_path Path(csv_file_dir csv_file_name)csv_file_dir_path Path(csv_file_dir)csv_file_dir_path.mkdir(parentsTrue, exist_okTrue)detail_dir Path(../douban/douban_book/douban_book_data_detail/)detail_file_list detail_dir.rglob(*.html)book_data []count 0for detail_file in detail_file_list:book_id str(detail_file).split(_)[-1].split(.)[0]soup get_soup(open(filedetail_file, moder, encodingutf-8))title soup.select_one(#wrapper h1 span).stringtag_subjectwrap soup.select_one(#content div div.article div.indent div.subjectwrap.clearfix)img_src tag_subjectwrap.select_one(#mainpic a img).attrs.get(src)tag_info tag_subjectwrap.select_one(div.subject.clearfix #info)tag_author tag_info.find(namespan, attrs{class: pl}, string 作者)if tag_author is None:author else:author tag_author.next_sibling.next_sibling.text.strip()tag_publisher tag_info.find(namespan, attrs{class: pl}, string出版社:)if tag_publisher is None:publisher else:publisher tag_publisher.next_sibling.next_sibling.text.strip()tag_producer tag_info.find(namespan, attrs{class: pl}, string出品方:)if tag_producer is None:producer else:producer tag_producer.next_sibling.next_sibling.text.strip()tag_original_title tag_info.find(namespan, attrs{class: pl}, string原作名:)if tag_original_title is None:original_title else:original_title tag_original_title.next_sibling.strip()tag_translator tag_info.find(namespan, attrs{class: pl}, string 译者)if tag_translator is None:translator else:translator tag_translator.next_sibling.next_sibling.text.strip()tag_publication_year tag_info.find(namespan, attrs{class: pl}, string出版年:)if tag_publication_year is None:publication_year else:publication_year tag_publication_year.next_sibling.strip()tag_page_count tag_info.find(namespan, attrs{class: pl}, string页数:)if tag_page_count is None:page_count else:page_count tag_page_count.next_sibling.strip()tag_price tag_info.find(namespan, attrs{class: pl}, string定价:)if tag_price is None:price else:price tag_price.next_sibling.strip()tag_binding tag_info.find(namespan, attrs{class: pl}, string装帧:)if tag_binding is None:binding else:binding tag_binding.next_sibling.strip()tag_series tag_info.find(namespan, attrs{class: pl}, string丛书:)if tag_series is None:series else:series tag_series.next_sibling.next_sibling.text.strip()tag_isbn tag_info.find(namespan, attrs{class: pl}, stringISBN:)if tag_isbn is None:isbn else:isbn tag_isbn.next_sibling.strip()# 评分信息tag_rating_wrap_clearbox tag_subjectwrap.select_one(#interest_sectl div)# 评分tag_rating (tag_rating_wrap_clearbox.select_one(#interest_sectl div div.rating_self.clearfix strong))if tag_rating is None:rating else:rating tag_rating.string.strip()# 评论人数tag_rating_sum tag_rating_wrap_clearbox.select_one(#interest_sectl div div.rating_self.clearfix div div.rating_sum span a span)if tag_rating_sum is None:rating_sum else:rating_sum tag_rating_sum.string.strip()# 评论链接comment_link fhttps://book.douban.com/subject/{book_id}/comments/# 五星比例tag_stars5_starstop tag_rating_wrap_clearbox.select_one(#interest_sectl div span.stars5.starstop)if tag_stars5_starstop is None:stars5_starstop else:stars5_starstop tag_stars5_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()# 四星比例tag_stars4_starstop tag_rating_wrap_clearbox.select_one(#interest_sectl div span.stars4.starstop)if tag_stars4_starstop is None:stars4_starstop else:stars4_starstop tag_stars4_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()# 三星比例tag_stars3_starstop tag_rating_wrap_clearbox.select_one(#interest_sectl div span.stars3.starstop)if tag_stars3_starstop is None:stars3_starstop else:stars3_starstop tag_stars3_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()# 二星比例tag_stars2_starstop tag_rating_wrap_clearbox.select_one(#interest_sectl div span.stars2.starstop)if tag_stars2_starstop is None:stars2_starstop else:stars2_starstop tag_stars2_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()# 一星比例tag_stars1_starstop tag_rating_wrap_clearbox.select_one(#interest_sectl div span.stars1.starstop)if tag_stars1_starstop is None:stars1_starstop else:stars1_starstop tag_stars1_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()data_dict {book_id: book_id,title: title,img_src: img_src,author: author,publisher: publisher,producer: producer,original_title: original_title,translator: translator,publication_year: publication_year,page_count: page_count,price: price,binding: binding,series: series,isbn: isbn,rating: rating,rating_sum: rating_sum,comment_link: comment_link,stars5_starstop: stars5_starstop,stars4_starstop: stars4_starstop,stars3_starstop: stars3_starstop,stars2_starstop: stars2_starstop,stars1_starstop: stars1_starstop}print(f文件路径{detail_file}解析后的数据如下)print(data_dict)print()# 把数据保存到列表中book_data.append(data_dict)count count 1if count 100:df pd.DataFrame(book_data)if not csv_file_path.exists():df.to_csv(csv_file_dir csv_file_name, indexFalse, encodingutf-8-sig)else:df.to_csv(csv_file_dir csv_file_name, indexFalse, encodingutf-8-sig, modea, headerFalse)book_data []count 0if __name__ __main__:parse_detail_html_to_csv()执行过程中打印的部分信息如下图所示 csv文件位置及内容如下图所示 2. 数据清洗与存储 2.1 数据清洗使用pandas进行数据清洗。空值除下列说明外对于空值统一使用未知来填充。日期空值使用1970-01-01来填充缺失月或日用01填充。页数空值使用0来填充。定价空值使用0来填充。评分空值使用0来填充。评分人数空值使用0来填充。星级评价空值使用0来填充。 2.2 数据存储把清洗后的数据保存到MySQL中。 2.2.1 表设计根据图片中的字段以下是设计的MySQL表结构。我将使用标准的SQL语法来定义这个表并以表格形式展示。字段名称数据类型说明book_idINT书籍的唯一标识符。titleVARCHAR(255)书名。img_srcVARCHAR(255)封面图片的网络地址。authorVARCHAR(255)作者姓名。publisherVARCHAR(255)出版社名称。producerVARCHAR(255)制作人或出品方如果有的话。original_titleVARCHAR(255)原版书名如果是翻译作品则为原语言书名。translatorVARCHAR(255)翻译者姓名如果有。publication_yearDATE出版年份。page_countINT页数。priceDECIMAL(10, 2)定价。bindingVARCHAR(255)装帧类型如平装、精装等。seriesVARCHAR(255)丛书系列名称如果有的话。isbnVARCHAR(20)国际标准书号。ratingDECIMAL(3, 1)平均评分。rating_sumINT参与评分的人数。comment_linkVARCHAR(255)用户评论链接。stars5_starstopDECIMAL(5, 2)五星评价所占的比例。stars4_starstopDECIMAL(5, 2)四星评价所占的比例。stars3_starstopDECIMAL(5, 2)三星评价所占的比例。stars2_starstopDECIMAL(5, 2)二星评价所占的比例。stars1_starstopDECIMAL(5, 2)一星评价所占的比例。 2.2.2 表实现创建数据库douban。 create database douban;切换到数据库douban。 use douban;创建数据表cleaned_douban_books用于存储清洗后的数据。 CREATE TABLE cleaned_douban_books (book_id INT PRIMARY KEY,title VARCHAR(255),img_src VARCHAR(255),author VARCHAR(255),publisher VARCHAR(255),producer VARCHAR(255),original_title VARCHAR(255),translator VARCHAR(255),publication_year DATE,page_count INT,price DECIMAL(10, 2),binding VARCHAR(255),series VARCHAR(255),isbn VARCHAR(20),rating DECIMAL(3, 1),rating_sum INT,comment_link VARCHAR(255),stars5_starstop DECIMAL(5, 2),stars4_starstop DECIMAL(5, 2),stars3_starstop DECIMAL(5, 2),stars2_starstop DECIMAL(5, 2),stars1_starstop DECIMAL(5, 2) );2.3 代码实现 import reimport pandas as pd from sqlalchemy import create_enginedef read_csv_to_df(file_path):# 加载CSV文件到DataFramedf pd.read_csv(file_path, encodingutf-8)return dfdef unify_date_format(date_str):# 检查是否为 NaN 或 Noneif pd.isna(date_str) or date_str is None:return None# 定义一个函数来处理特殊格式的日期def preprocess_date(date_str):# 如果是字符串并且包含中文格式的日期则进行替换if isinstance(date_str, str) and 年 in date_str and 月 in date_str:return date_str.replace(年, -).replace(月, -).replace(日, )return date_str# 预处理日期字符串processed_date preprocess_date(date_str)try:# 使用pd.to_datetime尝试转换日期格式date_obj pd.to_datetime(processed_date, errorscoerce)# 如果只有年份则添加默认的月份和日子为01if isinstance(date_obj, pd.Timestamp) and len(str(processed_date).split(-)) 1:date_obj date_obj.replace(month1, day1)# 返回标准化的日期字符串return date_obj.strftime(%Y-%m-%d) if not pd.isna(date_obj) else Noneexcept Exception as e:print(fError parsing date {date_str}: {e})return 1970-01-01def clean_price(price_str):if pd.isna(price_str) or not isinstance(price_str, str):return 0# 移除所有非数字字符保留数字和小数点cleaned re.sub(r[^\d./], , price_str)# 处理包含多个价格的情况这里选择平均值作为代表prices []for part in cleaned.split(/):# 进一步清理每个部分移除非数字和非小数点字符sub_parts re.findall(r\d\.\d|\d, part)if sub_parts:try:# 取每个部分的第一个匹配的价格price float(sub_parts[0])prices.append(price)except ValueError:continueif not prices:return 0# 根据需要选择不同的策略这里选择平均值avg_price sum(prices) / len(prices)# 确保保留两位小数return round(avg_price, 2)def clean_percentage(percentage_str):if pd.isna(percentage_str) or not isinstance(percentage_str, str):return 0# 移除百分比符号并转换为浮点数cleaned re.sub(r[^\d.], , percentage_str)return round(float(cleaned), 2)def clean_page_count(page_str):if not isinstance(page_str, str) or not page_str.strip():return 0# 移除非数字字符保留数字和分号cleaned re.sub(r[^\d;], , page_str)# 分离多个页数pages [int(p) for p in cleaned.split() if p]if not pages:return 0# 根据需要选择不同的策略这里选择最大值max_page max(pages)return max_page# 定义函数清理和转换数据格式 def clean_and_transform(df):# 删除book_id相同的数据df.drop_duplicates(subset[book_id])df[author].fillna(未知, inplaceTrue)df[publisher].fillna(未知, inplaceTrue)df[producer].fillna(未知, inplaceTrue)df[original_title].fillna(未知, inplaceTrue)df[translator].fillna(未知, inplaceTrue)# 日期空值使用1970-01-01来填充缺失月或日用01填充df[publication_year] df[publication_year].apply(unify_date_format)df[page_count].fillna(0, inplaceTrue)df[page_count] df[page_count].apply(clean_page_count)df[page_count] df[page_count].astype(int)df[price] df[price].apply(clean_price)df[binding].fillna(未知, inplaceTrue)df[series].fillna(未知, inplaceTrue)df[isbn].fillna(未知, inplaceTrue)df[rating].fillna(0, inplaceTrue)df[rating_sum].fillna(0, inplaceTrue)df[rating_sum] df[rating_sum].astype(int)df[stars5_starstop] df[stars5_starstop].apply(lambda x: clean_percentage(x))df[stars4_starstop] df[stars4_starstop].apply(lambda x: clean_percentage(x))df[stars3_starstop] df[stars3_starstop].apply(lambda x: clean_percentage(x))df[stars2_starstop] df[stars2_starstop].apply(lambda x: clean_percentage(x))df[stars1_starstop] df[stars1_starstop].apply(lambda x: clean_percentage(x))return dfdef save_df_to_db(df):# 设置数据库连接信息db_user rootdb_password zxcvbqdb_host 127.0.0.1 # 或者你的数据库主机地址db_port 3306 # MySQL默认端口是3306db_name douban# 创建数据库引擎engine create_engine(fmysqlmysqlconnector://{db_user}:{db_password}{db_host}:{db_port}/{db_name})# 将df写入MySQL表df.to_sql(namecleaned_douban_books, conengine, if_existsappend, indexFalse)print(所有csv文件的数据已成功清洗并写入MySQL数据库)if __name__ __main__:csv_file r..\douban\douban_book\data_csv\douban_books.csvdf read_csv_to_df(csv_file)df clean_and_transform(df)save_df_to_db(df) 查看cleaned_douban_books表中的图书数据 select * from cleaned_douban_books limit 10;

查看全文

http://www.ho-use.cn/article/10815673.html