关于爬虫:Python爬虫-Day-5

5次阅读

共计 3810 个字符,预计需要花费 10 分钟才能阅读完成。

Requests 模块下

解决不被信赖证书的网站

1. 需要 :向一个不被 SSL 信赖的网站发动申请 爬取数据
2. 指标 url:https://inv-veri.chinatax.gov…
3. 什么是 SSL?
(1) 定义 :SSL 证书是数字证书的一种,配置于服务器上
https = http + ssl
(2) 特点 :SSL 证书遵循了 SSL 协定 由受信赖的数字证书颁发机构验证身份后颁发的证书 如是公司本人制作 只管显示 https 但依然是不被信赖的
(3) 性能:SSL 证书同时具备服务器身份验证和数据传输加密性能

cookie

1. 定义
cookie 通过在客户端记录的信息确定用户身份
HTTP 是一种无连贯协定,客户端和服务器交互仅限于申请或响应过程,完结后断开,下一次申请时,服务器会认为是一个新的客户端,为了保护它们之间的连贯,让服务器晓得这是前一个用户发动的申请,必须在一个中央保留客户端信息
2. 作用
(1)反反爬
(2)模仿登录

补充申请与响应

1. 服务器渲染:可能在网页源码中看到数据
2. 客户端渲染:不能在网页源码中看到数据

代码

(1)代码 website_ssl

import requests

# 指标 url
url = 'https://inv-veri.chinatax.gov.cn/'

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
res = requests.get(url, headers=header, verify=False)
print(res.content.decode('utf-8'))

"""html = res.content.decode('utf-8')
filename = 'gov' + '.html'
with open(filename, 'w', encoding='utf-8') as g:
    g.write(html)    
# 瞎玩
"""

(2)代码 qzone- 模仿登录

import requests

# 指标 url
url = 'https://user.qzone.qq.com/xxxxxxxxxx'
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
    'cookie': 'RK=5lj83q5PSd; ptcz=1fa7f93cc2c18f43147a7189627ae6080740f72c3df2d08eb012e46db14a477b; pgv_pvid=2420772099; fqm_pvqid=73e03e61-6f91-4107-bc29-e1ac0449e88f; tmeLoginType=2; psrf_qqrefresh_token=53D67C7167BD9A1CAB39187D4C792C97; psrf_qqunionid=; wxunionid=; psrf_qqaccess_token=9CBCE82B15E8671851EFD4B1290DB131; wxopenid=; psrf_access_token_expiresAt=1630838824; psrf_qqopenid=24FAC6F91E334941373749413AE8BBB0; wxrefresh_token=; euin=oK45NK6FNeCq7n**; pac_uid=1_519188694; iip=0; pgv_info=ssid=s4897751060; o_cookie=1519188694; eas_sid=O1o6S2P8b5s65402b7L2L9p980; pvpqqcomrouteLine=wallpaper_wallpaper_wallpaper; _qpsvr_localtk=0.12057115483060432; welcomeflash=1519188694_96544; qz_screen=1536x864; 1519188694_todaycount=0; 1519188694_totalcount=34134; QZ_FE_WEBP_SUPPORT=1; cpu_performance_v8=6; __Q_w_s__QZN_TodoMsgCnt=1; zzpaneluin=; zzpanelkey=; _qz_referrer=i.qq.com; uin=o1519188694; skey=@SJO1fKDSM; p_uin=o1519188694; pt4_token=Aqy-OiLi1f7tedwzlg1wUy*laYC9M8AwdlPLQ-BIcHw_; p_skey=LeobMj5DrbZe75MiQKQXphzo6O-d3OqX25A8MIXGvNo_; qzone_check=1519188694_1628837790',
}

res = requests.get(url, headers=header)
html = res.content.decode('utf-8')

with open('qzone1.html', 'w', encoding='utf-8') as f:
    f.write(html)
print(html)

(3)代码 12306- 反反爬

import requests
import json

# 指标 url
url = 'https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date=2021-08-18&leftTicketDTO.from_station=BJP&leftTicketDTO.to_station=CQW&purpose_codes=ADULT'

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
    'Cookie': '_uab_collina=162859214484913672330819; JSESSIONID=3C9DB3104DC02AC8A7C9F0D62730C089; _jc_save_wfdc_flag=dc; RAIL_EXPIRATION=1629178843189; RAIL_DEVICEID=YVHwYLGf-RmKXq__VM-j4-kE5zSKDK92l_0zwONP7fBhWgqOrFnEF7hIUWCvkHM9NvfBzP_ske3ujsIS24pju38UI2R12sKWoTA7fQC7tvGPBrsKdW0hgcgVi_0aLunJ9RTtDmY02BcH4ZCmj8l7h44hhmfDHsd8; BIGipServerotn=3671523594.50210.0000; BIGipServerpassport=971505930.50215.0000; route=6f50b51faa11b987e576cdb301e545c4; _jc_save_fromStation=%u5317%u4EAC%2CBJP; _jc_save_toDate=2021-08-16; _jc_save_toStation=%u91CD%u5E86%2CCQW; _jc_save_fromDate=2021-08-18'
}

# 获取网页源码
res = requests.get(url, headers=header)
# html_str = res.content.decode('utf-8')
# html_dict = res.json()
# print(type(html_str), type(html_dict))
# print(html_dict)
# 察看上述打印数据 这样才阔以进行数据解析

html_str = res.content.decode('utf-8')
html_dict = json.loads(html_str)
# print(html_dict)
# 察看上述打印数据 这样才阔以进行数据解析  一些除了要害数字的符号是会变动的

# 解析数据
results = html_dict['data']['result']
# print(results)
for result in results:
    # print(result)
    # print('*' * 100)
    data_lst = result.split('|')
    # flag = 0
    # for d in data_lst:
    #     print(flag, d)
    #     flag += 1
    # print('*'* 50)
    # 咱们猜想 特等座的信息是下表索引 32 的数据 d[32] 车次是在下表索引为 3 的数据 d[3]
    t_name = data_lst[3]
    t_number = data_lst[32]
    # print(t_number, t_number)

# 进行判断
    if t_number != ''and t_number !=' 无 ':
        print(t_name, '有票')
    else:
        print(t_name, '无票')
正文完
 0