关于pyspider:python爬虫pyspider的第一个爬虫程序大功告成

对于pyspider的装置返回查看前序文章《踩坑记：终于怀着忐忑的情绪实现了对 python 爬虫扩大库 pyspider 的装置》

1、启动pyspider服务

1pyspider all

2、创立pyspider我的项目

3、我的项目区域阐明

4、从百度首页开始爬取

填写百度首页地址点击run开始爬取，点击爬取到的链接执行下一步

任意点击爬取到的链接进入下一步爬取

返回所进入的详情页内容

5、代码编辑区函数

 1#!/usr/bin/env python 2# -*- encoding: utf-8 -*- 3# Created on 2021-04-10 11:24:26 4# Project: test 5 6from pyspider.libs.base_handler import * 7 8# 解决类 9class Handler(BaseHandler):10    # 爬虫相干参数配置，全局失效(字典类型)11    crawl_config = {12        'url':'http://www.baidu.com'13    }1415    # 示意每天一次，minutes单位为分钟16    @every(minutes=24 * 60)17    # 程序入口18    def on_start(self):19        # 设置爬虫地址20        self.crawl('http://www.baidu.com', callback=self.index_page)2122    # 示意10天内不会再次爬取，age单位为秒23    @config(age=10 * 24 * 60 * 60)24    # 回调函数、数据解析25    def index_page(self, response):26        # response.doc() 返回的是pyquery对象，因而采纳pyquery对象解析27        for each in response.doc('a[href^="http"]').items():28            # 遍历并回调爬取详情页29            self.crawl(each.attr.href, callback=self.detail_page)3031    # 工作优先级设置32    @config(priority=2)33    # 回调函数、返回后果34    def detail_page(self, response):35        # 返回详情页36        return {37            "url": response.url,38            "title": response.doc('title').text(),39        }

更多精彩返回微信公众号【Python 集中营】，专一于 python 技术栈，材料获取、交换社区、干货分享，期待你的退出~