共计 4171 个字符,预计需要花费 11 分钟才能阅读完成。
前言
- 忽然想爬取 CSDN 的专栏文章到本地保留了,为了影响小一点,顺便筛选 CSDN 的首页进行展现。
- 综合资讯这一测试点是什么找到的呢?就是点击下图的热点文章,而后跳转到具体文章
,而后再点击专栏文章,就进入了
废话少说,间接上代码
- 用面向对象的思维实现了代码
- 代码思路放在了代码正文外面
- 用到了 requests、bs4 等支流接口
- 剩下的就是,心愿大家有空基于这个持续开发欠缺,哈哈哈哈!!!
- 此代码和实现内容都曾经打包上传只 Gitee,能够点击查看
"""
@Author:survive
@Blog(集体博客地址): https://blog.csdn.net/haojie_duan
@File:csdn.py.py
@Time:2022/2/10 8:49
@Motto: 我不晓得将去何方,但我已在路上。——宫崎骏《千与千寻》代码思路:1. 确定指标需要: 将 csdn 文章内容保留成 html、PDF、md 格局
- 1.1 首先保留为 html 格局:获取列表页中所有的文章 ur1 地址,申请文章 ur1 地址获取咱们须要的文章内容
- 1.2 通过 wkhtmitopdf.exe 把 html 文件转换成 PDF 文件
- 1.3 通过 wkhtmitopdf.exe 把 html 文件转换成 md 文件
2. 申请 ur1 获取网页源代码
3. 解析数据,提取本人想要内容
4. 保留数据
5. 转换数据类型把 HTML 转换成 PDF、md 文伴
"""html_str ="""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Document</title>
</head>
<body>
{article}
</body>
</html>
"""
import requests
import parsel
import pdfkit #用来将 html 转为 pdf
import re
import os
import urllib.parse
from bs4 import BeautifulSoup
import html2text #用来将 html 转换为 md
import random
# user_agent 库:每次执行一次拜访随机选取一个 user_agent,避免过于频繁拜访被禁止
USER_AGENT_LIST = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36"
]
class CSDNSpider():
def __init__(self):
self.url = 'https://blog.csdn.net/csdndevelopers/category_10594816.html'
self.headers = {'user-agent':random.choice(USER_AGENT_LIST)
}
def send_request(self, url):
response = requests.get(url=url, headers=self.headers)
response.encoding = "utf-8"
if response.status_code == 200:
return response
def parse_content(self, reponse):
html = reponse.text
selector = parsel.Selector(html)
href = selector.css('.column_article_list a::attr(href)').getall()
name = 00
for link in href:
print(link)
name = name + 1
# 对文章的 url 地址发送申请
response = self.send_request(link)
if response:
self.parse_detail(response, name)
def parse_detail(self, response, name):
html = response.text
# print(html)
selector = parsel.Selector(html)
title = selector.css('#articleContentId::text').get()
# content = selector.css('#content_views').get()
# 因为这里应用 parsel 拿到的网页文件,关上会主动跳转到 csdn 首页,并且不能转为 pdf,就想在这里用 soup
soup = BeautifulSoup(html, 'lxml')
content = soup.find('div',id="content_views",class_="markdown_views prism-atom-one-light" or "htmledit_views") #class_="htmledit_views"
# print(content)
# print(title, content)
html = html_str.format(article=content)
self.write_content(html, title)
def write_content(self, content, name):
html_path = "HTML/" + str(self.change_title(name)) + ".html"
pdf_path ="PDF/" + str(self.change_title(name))+ ".pdf"
md_path = "MD/" + str(self.change_title(name)) + ".md"
# 将内容保留为 html 文件
with open(html_path, 'w',encoding="utf-8") as f:
f.write(content)
print("正在保留", name, ".html")
# 将 html 文件转换成 PDF 文件
config = pdfkit.configuration(wkhtmltopdf=r'G:\Dev\wkhtmltopdf\bin\wkhtmltopdf.exe')
pdfkit.from_file(html_path, pdf_path, configuration=config)
print("正在保留", name, ".pdf")
# os.remove(html_path)
# 将 html 文件转换成 md 格局
html_text = open(html_path, 'r', encoding='utf-8').read()
markdown = html2text.html2text(html_text)
with open(md_path, 'w', encoding='utf-8') as file:
file.write(markdown)
print("正在保留", name, ".md")
def change_title(self, title):
mode = re.compile(r'[\\\/\:\?\*\"\<\>\|\!]')
new_title = re.sub(mode,'_', title)
return new_title
def start(self):
response = self.send_request(self.url)
if response:
self.parse_content(response)
if __name__ == '__main__':
csdn = CSDNSpider()
csdn.start()
成果如下:
- 从左到右,别离是 html、pdf、md 格局
- 轻易点开一篇文章查看,从左到右,别离是 html、pdf、md 格局
后记
参考文章:
- Python 爬虫:requests + BeautifulSoup4 爬取 CSDN 集体博客主页信息(博主信息、文章题目、文章链接)爬取博主每篇文章的信息(拜访、珍藏)非法刷访问量?
- 整顿最全的 python 之 markdown 与 HTML 的互转的几个模块
正文完