共计 14339 个字符,预计需要花费 36 分钟才能阅读完成。
一、我的项目介绍、开发工具及环境配置
1.1 我的项目介绍
博客园爬虫次要针对博客园的新闻页面进行爬取数据并入库。上面是操作步骤:
1、在关上新闻页面后,对其列表页数据的题目(含文本和链接)、图案(含图片和图片链接)、各个标签进行爬取。
2、依据深度优先遍历原理,再依据列表页的题目链接进行下一步深刻,爬取外面的题目、注释、公布工夫、类别标签(后面这些说的都是动态页面的爬取)和阅读数、评论数、同意数(也叫举荐数)(前面这些说的都是基于动静网页技术的)。
3、设计表构造,也即编辑字段和字段类型。同时编写入库函数,进行数据入库。
1.2 开发工具
Pycharm2019.3
Navicat for MySQL11.1.13
1.3 环境配置
应用命令行,cd 到你想搁置的虚拟环境(virtualenv)的门路下,输出 pip install virtualenv
这时就装置好虚拟环境了,上面咱们将用指定的 python3.6 版本配置新我的项目的虚拟环境。
mkvirtualenv -p D:\Python36-64_install_location\python.exe article_spider
其中,D:\Python36-64_install_location\ 是 python3.6 的装置门路,article_spider 是新我的项目的虚拟环境名。
上面要想进入虚拟环境(article_spider),输出 workon article_spider
进入虚拟环境后,因为某些 python 开发包在下载过程中会呈现 timeout 或很慢的状况,所以咱们应用 python 的豆瓣镜像,上面下载 python 爬虫框架 scrapy:
pip install -i https://pypi.douban.com/simple/ scrapy
当然,有些时候 Windows 的某些零碎下会装置出错,这时登录以下网址:
https://www.lfd.uci.edu/~gohl…
这外面寄存了所有 Windows 下容易出错的开发包,快捷键 Ctrl+ F 疾速搜寻须要的安装包,下载,下载好了当前,调出命令行,cd 到下载好的门路下,输出
pip install -i https://pypi.douban.com/simple 下载好的文件名称(蕴含后缀)
把握以上两种 pip install 形式基本上就能够搞定所有 python 开发包的装置。
二、数据库设计
数据库蕴含这些字段:
题目,网址,网址 Id,缓存图片门路,图片 URL,点赞数,评论数,阅读数,标签,内容,公布日期
字段类型如下:
编号 | 字段名称 | 数据类型 | 是否为主键 | 阐明 |
---|---|---|---|---|
1 | Title | varchar(255) | 否 | 题目 |
2 | Url | varchar(500) | 否 | 网址 |
3 | Url_object_id | varchar(50) | 是 | 网址的 Id |
4 | Front_image_path | varchar(200) | 否 | 缓存图片门路 |
5 | Front_image_url | varchar(500) | 否 | 图片 URL |
6 | Praise_nums | Int(11) | 否 | 点赞数 |
7 | Comment_nums | Int(11) | 否 | 评论数 |
8 | Fav_nums | Int(11) | 否 | 阅读数 |
9 | Tags | varchar(255) | 否 | 标签 |
10 | Content | longtext | 否 | 内容 |
11 | Create_date | datetime | 否 | 公布日期 |
三、代码实现
在 main 函数里设置增加爬虫(爬虫名字叫 cnblogs)的文件门路和执行开始基于 scrapy 框架的爬虫命令:
import sys
import os
from scrapy.cmdline import execute # 执行 scrapy 的命令
if __name__ == '__main__':
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy","crawl","cnblogs"])
到 cnblogs.py(留神这里的文件名要与 main 外面对应的爬虫名一样)里编写外围代码:
import re
import json
import scrapy
from urllib import parse
from scrapy import Request
from CnblogsSpider.utils import common
from CnblogsSpider.items import CnblogsArticleItem, ArticleItemLoader
class CnblogsSpider(scrapy.Spider):
name = 'cnblogs'
allowed_domains = ['news.cnblogs.com'] # allowed_domains 域名,也即容许的范畴
start_urls = ['http://news.cnblogs.com/'] # 启动 main,进入爬虫,start_urls 的 html 就下载好了
custom_settings = { # 笼罩 settings 以避免其余爬虫被追踪
"COOKIES_ENABLED":True
}
def start_requests(self): # 入口能够模仿登录拿到 cookie
import undetected_chromedriver.v2 as uc
browser=uc.Chrome() #主动启动 Chrome
browser.get("https://account.cnblogs.com/signin")
input("回车持续:")
cookies=browser.get_cookies() # 拿到 cookie 并转成 dict
cookie_dict={}
for cookie in cookies:
cookie_dict[cookie['name']]=cookie['value']
for url in self.start_urls:
headers ={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36'
} # 设置 headers 进一步避免浏览器辨认出爬虫程序
yield scrapy.Request(url, cookies=cookie_dict, headers=headers, dont_filter=True) # 将 cookie 交给 scrapy
# 以上是模仿登录代码
def parse(self, response):
url = response.xpath('//div[@id="news_list"]//h2[@class="news_entry"]/a/@href').extract_first("")
post_nodes = response.xpath('//div[@class="news_block"]') # selectorlist
for post_node in post_nodes: # selector
image_url = post_node.xpath('.//div[@class="entry_summary"]/a/img/@src').extract_first("") # 用 xpath 选取元素并提取出字符串类型的 url
if image_url.startswith("//"):
image_url="https:"+image_url
post_url = post_node.xpath('.//h2[@class="news_entry"]/a/@href').extract_first("") # 留神要加点号,示意选取一个区域外部的另一个区域
yield Request(url=parse.urljoin(response.url, post_url), meta={"front_image_url": image_url},
callback=self.parse_detail)
# 提取下一页的 URL 并交给 scrapy 进行下载
next_url = response.xpath('//a[contains(text(),"Next >")]/@href').extract_first("")
yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
def parse_detail(self, response):
match_re = re.match(".*?(\d+)", response.url)
if match_re:
post_id = match_re.group(1)
# title = response.xpath('//div[@id="news_title"]/a/text()').extract_first("")
# create_date = response.xpath('//*[@id="news_info"]//*[@class="time"]/text()').extract_first("")
# match_re = re.match(".*?(\d+.*)", create_date)
# if match_re:
# create_date = match_re.group(1)
# content = response.xpath('//div[@id="news_content"]').extract()[0]
# tag_list = response.xpath('//div[@class="news_tags"]/a/text()').extract()
# tags = ",".join(tag_list)
# article_item = CnblogsArticleItem()
# article_item["title"] = title
# article_item["create_date"] = create_date
# article_item["content"] = content
# article_item["tags"] = tags
# article_item["url"] = response.url
# if response.meta.get("front_image_url", ""):
# article_item["front_image_url"] = [response.meta.get("front_image_url", "")]
# else:
# article_item["front_image_url"] = []
item_loader=ArticleItemLoader(item=CnblogsArticleItem(),response=response)
# item_loader.add_xpath('title','//div[@id="news_title"]/a/text()')
# item_loader.add_xpath('create_date', '//*[@id="news_info"]//*[@class="time"]/text()')
# item_loader.add_xpath('content', '//div[@id="news_content"]')
# item_loader.add_xpath('tags', '//div[@class="news_tags"]/a/text()')
item_loader.add_xpath("title", "//div[@id='news_title']/a/text()")
item_loader.add_xpath("create_date", "//*[@id='news_info']//*[@class='time']/text()")
item_loader.add_xpath("content", "//div[@id='news_content']")
item_loader.add_xpath("tags", "//div[@class='news_tags']/a/text()")
item_loader.add_value("url",response.url)
if response.meta.get("front_image_url", ""):
item_loader.add_value("front_image_url",response.meta.get("front_image_url", ""))
# article_item=item_loader.load_item()
yield Request(url=parse.urljoin(response.url, "/NewsAjax/GetAjaxNewsInfo?contentId={}".format(post_id)),
meta={"article_item": item_loader,"url":response.url}, callback=self.parse_nums)
def parse_nums(self, response):
j_data = json.loads(response.text)
item_loader = response.meta.get("article_item", "")
# praise_nums = j_data["DiggCount"]
# fav_nums = j_data["TotalView"]
# comment_nums = j_data["CommentCount"]
item_loader.add_value("praise_nums",j_data["DiggCount"])
item_loader.add_value("fav_nums", j_data["TotalView"])
item_loader.add_value("comment_nums", j_data["CommentCount"])
item_loader.add_value("url_object_id", common.get_md5(response.meta.get("url","")))
# article_item["praise_nums"] = praise_nums
# article_item["fav_nums"] = fav_nums
# article_item["comment_nums"] = comment_nums
# article_item["url_object_id"] = common.get_md5(article_item["url"])
article_item = item_loader.load_item()
yield article_item
这里所说的动静网页技术的处理过程如下:
按 F12 调出开发者模式,刷新后找 network,找有 Ajax 字样的 name,点击后查看对应的网址并转入对应的网址就可看出,外面有 json 格局的数据。外围要害代码如下:
def parse_detail(self, response):
match_re = re.match(".*?(\d+)", response.url)
if match_re:
post_id = match_re.group(1)
item_loader.add_value("url",response.url)
yield Request(url=parse.urljoin(response.url, "/NewsAjax/GetAjaxNewsInfo?contentId={}".format(post_id)),
meta={"article_item": item_loader,"url":response.url}, callback=self.parse_nums)
def parse_nums(self, response):
j_data = json.loads(response.text)
item_loader = response.meta.get("article_item", "")
item_loader.add_value("praise_nums",j_data["DiggCount"])
item_loader.add_value("fav_nums", j_data["TotalView"])
item_loader.add_value("comment_nums", j_data["CommentCount"])
item_loader.add_value("url_object_id", common.get_md5(response.meta.get("url","")))
article_item = item_loader.load_item()
yield article_item
items.py 解决数据:
import re
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join, MapCompose, TakeFirst, Identity
class CnblogsspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
def date_convert(value):
match_re = re.match(".*?(\d+.*)", value)
if match_re:
return match_re.group(1)
else:
return "1970-07-01"
# def remove_tags(value):
# #去掉 tag 中提取的评论, 如遇到评论删除评论这个数据,再用 MapCompose()传递过去
# if "评论" in value:
# return ""
# else:
# return value
class ArticleItemLoader(ItemLoader):
default_output_processor = TakeFirst() # 将 list 的第一个值以字符串格局输入且仅输入第一个
class CnblogsArticleItem(scrapy.Item):
title=scrapy.Field()
create_date=scrapy.Field(input_processor=MapCompose(date_convert) # 对数字进行正则解决
)
url=scrapy.Field()
url_object_id=scrapy.Field()
front_image_url=scrapy.Field(output_processor=Identity() # 采纳原来的格局
)
front_image_path=scrapy.Field()
praise_nums=scrapy.Field()
comment_nums=scrapy.Field()
fav_nums=scrapy.Field()
tags=scrapy.Field(output_processor=Join(separator=",") # 将 list 们 join 起来
)
content=scrapy.Field()
pipelines.py 里解决进入数据库的形式:
import scrapy
import requests
import MySQLdb
from MySQLdb.cursors import DictCursor
from twisted.enterprise import adbapi
from scrapy.exporters import JsonItemExporter
from scrapy.pipelines.images import ImagesPipeline
class CnblogsSpiderPipeline(object):
def process_item(self, item, spider):
return item
class ArticleImagePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['front_image_url']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info): # 图片下载过程中的拦挡
if "front_image_url" in item:
image_file_path=""
for ok,value in results:
image_file_path=value["path"]
item["front_image_path"]=image_file_path
return item
# def get_media_requests(self, item, info):
# for image_url in item['front_image_url']:
# yield self.Request(image_url)
# class ArticleImagePipeline(ImagesPipeline):
# def item_completed(self, results, item, info):
# if "front_image_url" in item:
# for ok, value in results:
# image_file_path = value["path"]
# item["front_image_path"] = image_file_path
#
# return item
class JsonExporterPipeline(object):
# 第一步,关上文件
def __init__(self):
self.file = open("articleexport.json", "wb") # w 写入 a 追加
self.exporter=JsonItemExporter(self.file,encoding="utf-8",ensure_ascii=False)
self.exporter.start_exporting()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def spider_closed(self, spider):
self.exporter.finish_exporting()
self.file.close()
class MysqlTwistedPipline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
dbparms = dict(host = settings["MYSQL_HOST"],
db = settings["MYSQL_DBNAME"],
user = settings["MYSQL_USER"],
passwd = settings["MYSQL_PASSWORD"],
charset='utf8',
cursorclass=DictCursor,
use_unicode=True,
)
dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)
return cls(dbpool)
def process_item(self, item, spider):
# 应用 twisted 将 mysql 插入变成异步执行
query = self.dbpool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error, item, spider) # 解决异样
return item
def handle_error(self, failure, item, spider):
# 解决异步插入的异样
print (failure)
def do_insert(self, cursor, item):
# 执行具体的插入
# 依据不同的 item 构建不同的 sql 语句并插入到 mysql 中
# insert_sql, params = item.get_insert_sql()
insert_sql = """
insert into cnblogs_article(title, url, url_object_id, front_image_url, front_image_path, praise_nums, comment_nums, fav_nums, tags, content, create_date)
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE praise_nums=VALUES(praise_nums)
""" # 产生主键抵触时用 praise_nums 更新 praise_nums
# 便于排查
params = list()
# params.append(item["title"]) # 为避免抛异样,设置上面的做法,容许为空
params.append(item.get("title", ""))
params.append(item.get("url", ""))
params.append(item.get("url_object_id", ""))
# params.append(item.get("front_image_url", "")) # 不改的话,传过来的是个 list,故当为空列表时要转化成字符串,用,join 转为字符串
front_image = ",".join(item.get("front_image_url", []))
params.append(front_image)
params.append(item.get("front_image_path", ""))
params.append(item.get("praise_nums", 0))
params.append(item.get("comment_nums", 0))
params.append(item.get("fav_nums", 0))
params.append(item.get("tags", ""))
params.append(item.get("content", ""))
params.append(item.get("create_date", "1970-07-01"))
cursor.execute(insert_sql, tuple(params)) # list 强转成 tuple
settings.py 配置全局设置:
import os
# Scrapy settings for CnblogsSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'CnblogsSpider'
SPIDER_MODULES = ['CnblogsSpider.spiders']
NEWSPIDER_MODULE = 'CnblogsSpider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'CnblogsSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'CnblogsSpider.middlewares.CnblogsspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'CnblogsSpider.middlewares.CnblogsspiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'CnblogsSpider.pipelines.CnblogsspiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
ITEM_PIPELINES = {
'CnblogsSpider.pipelines.ArticleImagePipeline':1,
'CnblogsSpider.pipelines.MysqlTwistedPipline':2,
'CnblogsSpider.pipelines.JsonExporterPipeline':3,
'CnblogsSpider.pipelines.CnblogsSpiderPipeline': 300
}
IMAGES_URLS_FILED="front_image_url"
project_dir=os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE=os.path.join(project_dir,'images')
MYSQL_HOST = "127.0.0.1"
MYSQL_DBNAME = "article_spider"
MYSQL_USER = "root"
MYSQL_PASSWORD = "root"
SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
SQL_DATE_FORMAT = "%Y-%m-%d"
common.py 解决 url 主动生成 md5 格局:
import hashlib
def get_md5(url):
if isinstance(url,str):
url=url.encode("utf-8")
m=hashlib.md5()
m.update(url)
return m.hexdigest()
最初运关上 Navicat for MySQL,连贯好数据库后,运行 main 文件,爬虫就开始运行并入库啦~