关于爬虫:python爬虫-Day-2

爬虫网络申请模块上

urllib模块

1.urllib模块是什么？
（1）是python内置的网络申请模块，例如re，time模块
（2）第三方模块有如：requests，scrapy等

2.为什么要学习urllib模块？
（1）比照学习第三方模块requests
（2）局部爬虫我的项目须要应用urllib模块
（3）有时候urllib+requests模块配合应用更简洁

urllib疾速入门

1.urllib.request的应用
（1）urllib.request.urlopen('网站')
（2）urllib.request.urlopen(申请对象)

   a.创立一个申请对象 构建UA   b.获取响应对象 通过urlopen()   c.获取响应对象的内容 read().decode('utf-8')

附注（响应对象）：

  print(res.getcode()) # 获取状态码  print(res.geturl())  # 获取申请的url地址

2.urllib.parse的应用——Day 3

代码：

（1）代码UserAgent：次要目标还是为了避免被检测到是机器爬虫，个别是反反爬的第一步

 import requests url = 'https://www.baidu.com/' headers = {     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', } res = requests.get(url,headers=headers) # print(res) header = res.request.headers print(header)

（2）代码urllib_request

 import urllib.request header={      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', } url = 'https://www.baidu.com/' # response响应对象 response = urllib.request.urlopen('https://www.baidu.com/')  # 1.打印的数据是字节流 数据类型 # read()办法把响应对象外面的内容提取进去 # type()查看数据类型 print(response.read().decode('utf-8'),type(response.read().decode('utf-8'))) # 2.数据不对（1）创立一个申请对象 构建UA req = urllib.request.Request(url, headers=header) # urlopen()办法能够实现最根本的申请的发动，但如果要退出Headers等信息，就能够利用Request类来结构申请（2）获取响应对象 通过urlopen() res = urllib.request.urlopen(req)（3）获取响应对象的内容 read().decode('utf-8') print(res.read().decode('utf-8'),type(res.read().decode('utf-8')))

（3）代码pictures

import requestsfrom urllib import requesturl = 'http://shp.qpic.cn/ishow/2735061617/1623836895_84828260_11275_sProdImgNo_2.jpg/0'# 第一种形式req = requests.get(url)print(req)f = open('code_img1.png','wb')f.write(req.content)# 第二种形式req = requests.get(url)with open('code-img2.png','wb') as f:f.write(req.content)# 第三种形式request.urlretrieve(url,'code_img3.png')