关于python:16python爬虫之Requests库爬取海量图片

5次阅读

共计 15468 个字符,预计需要花费 39 分钟才能阅读完成。

Requests 是一个 Python 的 HTTP 客户端库。

Request 反对 HTTP 连贯放弃和连接池,反对应用 cookie 放弃会话,反对文件上传,反对主动响应内容的编码,反对国际化的 URL 和 POST 数据自动编码。

在 python 内置模块的根底上进行了高度的封装 从而使得 python 进行网络申请时,变得人性化,应用 Requests 能够轻而易举的实现浏览器可有的任何操作。古代,国际化,敌对

requests 会主动实现长久连贯 keep-alive

开源地址:https://github.com/kennethreitz/requests
中文文档:http://docs.python-requests.org/zh_CN/latest/index.html

目录
一、Requests 根底
二、发送申请与接管响应(根本 GET 申请)
三、发送申请与接管响应(根本 POST 申请)
四、response 属性
五、代理
六、cookie 和 session
七、案例

一、Requests 根底

1. 装置 Requests 库

pip install  requests

2. 应用 Requests 库

import requests

二、发送申请与接管响应(根本 GET 申请)

response = requests.get(url)

1. 传送 parmas 参数

  • 参数蕴含在 url 中
response = requests.get("http://httpbin.org/get?name=zhangsan&age=22")
print(response.text)

  • 通过 get 办法传送参数
data = {
        "name": "zhangsan",
        "age": 30
    }
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)

2. 模仿发送申请头(传送 headers 参数)

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
}
response = requests.get("http://httpbin.org/get", headers=headers)
print(response.text)

三、发送申请与接管响应(根本 POST 申请)

response = requests.post(url, data = data, headers=headers)

四、response 属性

属性 形容
response.text 获取 str 类型(Unicode 编码)的响应
response.content 获取 bytes 类型的响应
response.status_code 获取响应状态码
response.headers 获取响应头
response.request 获取响应对应的申请

五、代理

proxies = {
    "http": "https://175.44.148.176:9000",
    "https": "https://183.129.207.86:14002"
}
response = requests.get("https://www.baidu.com/", proxies=proxies)

六、cookie 和 session

  • 应用的 cookie 和 session 益处:很多网站必须登录之后 (或者获取某种权限之后) 能力可能申请到相干数据。
  • 应用的 cookie 和 session 的弊病:一套 cookie 和 session 往往和一个用户对应. 申请太快,申请次数太多,容易被服务器辨认为爬虫,从而使账号收到侵害。

1. 不须要 cookie 的时候尽量不去应用 cookie。
2. 为了获取登录之后的页面,咱们必须发送带有 cookies 的申请,此时为了确保账号平安应该尽量升高数据
采集速度。

1.cookie

(1)获取 cookie 信息

response.cookies

2.session

(1)结构 session 回话对象

session = requests.session()

示例:

def login_renren():
    login_url = 'http://www.renren.com/SysHome.do'
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
    }

    session = requests.session()

    login_data = {
        "email": "账号",
        "password": "明码"
    }

    response = session.post(login_url, data=login_data, headers=headers)

    response = session.get("http://www.renren.com/971909762/newsfeed/photo")
    print(response.text)

login_renren()

七、案例

案例 1:百度贴吧页面爬取(GET 申请)

import requests
import sys

class BaiduTieBa:
    def __init__(self, name, pn,):
        self.name = name
        self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn={}".format(name, pn)
        self.headers = {# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"

            # 应用较老版本的申请头,该浏览器不反对 js
            "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
        }
        self.url_list = [self.url + str(pn*50) for pn in range(pn)]
        print(self.url_list)

    def get_data(self, url):
        """
        申请数据
        :param url:
        :return:
        """
        response = requests.get(url, headers=self.headers)
        return response.content

    def save_data(self, data, num):
        """
        保留数据
        :param data:
        :param num:
        :return:
        """file_name ="./pages/"+ self.name +"_"+ str(num) +".html"with open(file_name,"wb") as f:
            f.write(data)

    def run(self):
        for url in self.url_list:
            data = self.get_data(url)
            num = self.url_list.index(url)
            self.save_data(data, num)

if __name__ == "__main__":
    name = sys.argv[1]
    pn = int(sys.argv[2])
    baidu = BaiduTieBa(name, pn)
    baidu.run()

案例 2:金山词霸翻译(POST 申请)

import requests
import sys
import json

class JinshanCiBa:
    def __init__(self, words):
        self.url = "http://fy.iciba.com/ajax.php?a=fy"
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
            "X-Requested-With": "XMLHttpRequest"
        }
        self.post_data = {
            "f": "auto",
            "t": "auto",
            "w": words
        }

    def get_data(self):
        """
        申请数据
        :param url:
        :return:
        """
        response = requests.post(self.url, data=self.post_data, headers=self.headers)
        return response.text

    def show_translation(self):
        """
        显示翻译后果
        :param data:
        :param num:
        :return:
        """
        response = self.get_data()
        json_data = json.loads(response, encoding='utf-8')
        if json_data['status'] == 0:
            translation = json_data['content']['word_mean']
        elif json_data['status'] == 1:
            translation = json_data['content']['out']
        else:
            translation = None
        print(translation)

    def run(self):
        self.show_translation()

if __name__ == "__main__":
    words = sys.argv[1]
    ciba = JinshanCiBa(words)
    ciba.run()

案例 3:百度贴吧图片爬取

(1)一般版

从已下载页面中提取 url 来爬取图片(页面下载办法见案例 1)

from lxml import etree
import requests

class DownloadPhoto:
    def __init__(self):
        pass

    def download_img(self, url):
        response = requests.get(url)
        index = url.rfind('/')
        file_name = url[index + 1:]
        print("下载图片:" + file_name)
        save_name = "./photo/" + file_name
        with open(save_name, "wb") as f:
            f.write(response.content)

    def parse_photo_url(self, page):
        html = etree.parse(page, etree.HTMLParser())
        nodes = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
        print(nodes)
        print(len(nodes))
        for node in nodes:
            self.download_img(node)

if __name__ == "__main__":
    download = DownloadPhoto()
    for i in range(6000):
        download.parse_photo_url("./pages/ 校花_{}.html".format(i))

(2)多线程版

main.py

import requests
from lxml import etree

from file_download import DownLoadExecutioner, file_download

class XiaoHua:
    def __init__(self, init_url):
        self.init_url = init_url
        self.download_executioner = DownLoadExecutioner()

    def start(self):
        self.download_executioner.start()
        self.download_img(self.init_url)

    def download_img(self, url):
        html_text = file_download(url, type='text')
        html = etree.HTML(html_text)
        img_urls = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
        self.download_executioner.put_task(img_urls)

        # 获取下一页的连贯
        next_page = html.xpath("//div[@id='frs_list_pager']/a[contains(@class,'next')]/@href")
        next_page = "http:" + next_page[0]
        self.download_img(next_page)

if __name__ == '__main__':
    x = XiaoHua("http://tieba.baidu.com/f?kw= 校花 &ie=utf-8")
    x.start()

file_download.py

import requests
import threading
from queue import Queue

def file_download(url, type='content'):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
    }
    r = requests.get(url, headers=headers)
    if type == 'text':
        return r.text

    return r.content

class DownLoadExecutioner(threading.Thread):
    def __init__(self):
        super().__init__()
        self.q = Queue(maxsize=50)
        # 图片保留目录
        self.save_dir = './img/'
        # 图片计数
        self.index = 0

    def put_task(self, urls):
        if isinstance(urls, list):
            for url in urls:
                self.q.put(url)
        else:
            self.q.put(urls)

    def run(self):
        while True:
            url = self.q.get()
            content = file_download(url)

            # 截取图片名称
            index = url.rfind('/')
            file_name = url[index+1:]
            save_name = self.save_dir + file_name
            with open(save_name, 'wb+') as f:
                f.write(content)
                self.index += 1
                print(save_name + "下载胜利!  以后已下载图片总数:" + str(self.index))

(3)线程池版

main.py

import requests
from lxml import etree

from file_download_pool import DownLoadExecutionerPool, file_download

class XiaoHua:
    def __init__(self, init_url):
        self.init_url = init_url
        self.download_executioner = DownLoadExecutionerPool()

    def start(self):
        self.download_img(self.init_url)

    def download_img(self, url):
        html_text = file_download(url, type='text')
        html = etree.HTML(html_text)
        img_urls = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
        self.download_executioner.put_task(img_urls)

        # 获取下一页的连贯
        next_page = html.xpath("//div[@id='frs_list_pager']/a[contains(@class,'next')]/@href")
        next_page = "http:" + next_page[0]
        self.download_img(next_page)

if __name__ == '__main__':
    x = XiaoHua("http://tieba.baidu.com/f?kw= 校花 &ie=utf-8")
    x.start()

file_download_pool.py

import requests
import concurrent.futures as futures

def file_download(url, type='content'):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
    }
    r = requests.get(url, headers=headers)
    if type == 'text':
        return r.text

    return r.content

class DownLoadExecutionerPool():
    def __init__(self):
        super().__init__()
        # 图片保留目录
        self.save_dir = './img_pool/'
        # 图片计数
        self.index = 0
        # 线程池
        self.ex = futures.ThreadPoolExecutor(max_workers=30)

    def put_task(self, urls):
        if isinstance(urls, list):
            for url in urls:
                self.ex.submit(self.save_img, url)
        else:
            self.ex.submit(self.save_img, urls)

    def save_img(self, url):
        content = file_download(url)

        # 截取图片名称
        index = url.rfind('/')
        file_name = url[index+1:]
        save_name = self.save_dir + file_name
        with open(save_name, 'wb+') as f:
            f.write(content)
            self.index += 1
            print(save_name + "下载胜利!  以后已下载图片总数:" + str(self.index))

作者:Recalcitrant
链接:https://www.jianshu.com/p/140… 是一个 Python 的 HTTP 客户端库。

Request 反对 HTTP 连贯放弃和连接池,反对应用 cookie 放弃会话,反对文件上传,反对主动响应内容的编码,反对国际化的 URL 和 POST 数据自动编码。
在 python 内置模块的根底上进行了高度的封装,从而使得 python 进行网络申请时,变得人性化,应用 Requests 能够轻而易举的实现浏览器可有的任何操作。古代,国际化,敌对。

requests 会主动实现长久连贯 keep-alive
![image](http://upload-images.jianshu….)

开源地址:https://github.com/kennethreitz/requests
中文文档:http://docs.python-requests.org/zh_CN/latest/index.html

目录
一、Requests 根底
二、发送申请与接管响应(根本 GET 申请)
三、发送申请与接管响应(根本 POST 申请)
四、response 属性
五、代理
六、cookie 和 session
七、案例

一、Requests 根底

1. 装置 Requests 库

pip install  requests

2. 应用 Requests 库

import requests

二、发送申请与接管响应(根本 GET 申请)

response = requests.get(url)

1. 传送 parmas 参数

  • 参数蕴含在 url 中
response = requests.get("http://httpbin.org/get?name=zhangsan&age=22")
print(response.text)

  • 通过 get 办法传送参数
data = {
        "name": "zhangsan",
        "age": 30
    }
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)

2. 模仿发送申请头(传送 headers 参数)

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
}
response = requests.get("http://httpbin.org/get", headers=headers)
print(response.text)

三、发送申请与接管响应(根本 POST 申请)

response = requests.post(url, data = data, headers=headers)

四、response 属性

属性 形容
response.text 获取 str 类型(Unicode 编码)的响应
response.content 获取 bytes 类型的响应
response.status_code 获取响应状态码
response.headers 获取响应头
response.request 获取响应对应的申请

五、代理

proxies = {
    "http": "https://175.44.148.176:9000",
    "https": "https://183.129.207.86:14002"
}
response = requests.get("https://www.baidu.com/", proxies=proxies)

六、cookie 和 session

  • 应用的 cookie 和 session 益处:很多网站必须登录之后 (或者获取某种权限之后) 能力可能申请到相干数据。
  • 应用的 cookie 和 session 的弊病:一套 cookie 和 session 往往和一个用户对应. 申请太快,申请次数太多,容易被服务器辨认为爬虫,从而使账号收到侵害。

1. 不须要 cookie 的时候尽量不去应用 cookie。
2. 为了获取登录之后的页面,咱们必须发送带有 cookies 的申请,此时为了确保账号平安应该尽量升高数据
采集速度。

1.cookie

(1)获取 cookie 信息

response.cookies

2.session

(1)结构 session 回话对象

session = requests.session()

示例:

def login_renren():
    login_url = 'http://www.renren.com/SysHome.do'
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
    }

    session = requests.session()

    login_data = {
        "email": "账号",
        "password": "明码"
    }

    response = session.post(login_url, data=login_data, headers=headers)

    response = session.get("http://www.renren.com/971909762/newsfeed/photo")
    print(response.text)

login_renren()

七、案例

案例 1:百度贴吧页面爬取(GET 申请)

import requests
import sys

class BaiduTieBa:
    def __init__(self, name, pn,):
        self.name = name
        self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn={}".format(name, pn)
        self.headers = {# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"

            # 应用较老版本的申请头,该浏览器不反对 js
            "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
        }
        self.url_list = [self.url + str(pn*50) for pn in range(pn)]
        print(self.url_list)

    def get_data(self, url):
        """
        申请数据
        :param url:
        :return:
        """
        response = requests.get(url, headers=self.headers)
        return response.content

    def save_data(self, data, num):
        """
        保留数据
        :param data:
        :param num:
        :return:
        """file_name ="./pages/"+ self.name +"_"+ str(num) +".html"with open(file_name,"wb") as f:
            f.write(data)

    def run(self):
        for url in self.url_list:
            data = self.get_data(url)
            num = self.url_list.index(url)
            self.save_data(data, num)

if __name__ == "__main__":
    name = sys.argv[1]
    pn = int(sys.argv[2])
    baidu = BaiduTieBa(name, pn)
    baidu.run()

案例 2:金山词霸翻译(POST 申请)

import requests
import sys
import json

class JinshanCiBa:
    def __init__(self, words):
        self.url = "http://fy.iciba.com/ajax.php?a=fy"
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
            "X-Requested-With": "XMLHttpRequest"
        }
        self.post_data = {
            "f": "auto",
            "t": "auto",
            "w": words
        }

    def get_data(self):
        """
        申请数据
        :param url:
        :return:
        """
        response = requests.post(self.url, data=self.post_data, headers=self.headers)
        return response.text

    def show_translation(self):
        """
        显示翻译后果
        :param data:
        :param num:
        :return:
        """
        response = self.get_data()
        json_data = json.loads(response, encoding='utf-8')
        if json_data['status'] == 0:
            translation = json_data['content']['word_mean']
        elif json_data['status'] == 1:
            translation = json_data['content']['out']
        else:
            translation = None
        print(translation)

    def run(self):
        self.show_translation()

if __name__ == "__main__":
    words = sys.argv[1]
    ciba = JinshanCiBa(words)
    ciba.run()

案例 3:百度贴吧图片爬取

(1)一般版

从已下载页面中提取 url 来爬取图片(页面下载办法见案例 1)

from lxml import etree
import requests

class DownloadPhoto:
    def __init__(self):
        pass

    def download_img(self, url):
        response = requests.get(url)
        index = url.rfind('/')
        file_name = url[index + 1:]
        print("下载图片:" + file_name)
        save_name = "./photo/" + file_name
        with open(save_name, "wb") as f:
            f.write(response.content)

    def parse_photo_url(self, page):
        html = etree.parse(page, etree.HTMLParser())
        nodes = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
        print(nodes)
        print(len(nodes))
        for node in nodes:
            self.download_img(node)

if __name__ == "__main__":
    download = DownloadPhoto()
    for i in range(6000):
        download.parse_photo_url("./pages/ 校花_{}.html".format(i))

(2)多线程版

main.py

import requests
from lxml import etree

from file_download import DownLoadExecutioner, file_download

class XiaoHua:
    def __init__(self, init_url):
        self.init_url = init_url
        self.download_executioner = DownLoadExecutioner()

    def start(self):
        self.download_executioner.start()
        self.download_img(self.init_url)

    def download_img(self, url):
        html_text = file_download(url, type='text')
        html = etree.HTML(html_text)
        img_urls = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
        self.download_executioner.put_task(img_urls)

        # 获取下一页的连贯
        next_page = html.xpath("//div[@id='frs_list_pager']/a[contains(@class,'next')]/@href")
        next_page = "http:" + next_page[0]
        self.download_img(next_page)

if __name__ == '__main__':
    x = XiaoHua("http://tieba.baidu.com/f?kw= 校花 &ie=utf-8")
    x.start()

file_download.py

import requests
import threading
from queue import Queue

def file_download(url, type='content'):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
    }
    r = requests.get(url, headers=headers)
    if type == 'text':
        return r.text

    return r.content

class DownLoadExecutioner(threading.Thread):
    def __init__(self):
        super().__init__()
        self.q = Queue(maxsize=50)
        # 图片保留目录
        self.save_dir = './img/'
        # 图片计数
        self.index = 0

    def put_task(self, urls):
        if isinstance(urls, list):
            for url in urls:
                self.q.put(url)
        else:
            self.q.put(urls)

    def run(self):
        while True:
            url = self.q.get()
            content = file_download(url)

            # 截取图片名称
            index = url.rfind('/')
            file_name = url[index+1:]
            save_name = self.save_dir + file_name
            with open(save_name, 'wb+') as f:
                f.write(content)
                self.index += 1
                print(save_name + "下载胜利!  以后已下载图片总数:" + str(self.index))

(3)线程池版

main.py

import requests
from lxml import etree

from file_download_pool import DownLoadExecutionerPool, file_download

class XiaoHua:
    def __init__(self, init_url):
        self.init_url = init_url
        self.download_executioner = DownLoadExecutionerPool()

    def start(self):
        self.download_img(self.init_url)

    def download_img(self, url):
        html_text = file_download(url, type='text')
        html = etree.HTML(html_text)
        img_urls = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
        self.download_executioner.put_task(img_urls)

        # 获取下一页的连贯
        next_page = html.xpath("//div[@id='frs_list_pager']/a[contains(@class,'next')]/@href")
        next_page = "http:" + next_page[0]
        self.download_img(next_page)

if __name__ == '__main__':
    x = XiaoHua("http://tieba.baidu.com/f?kw= 校花 &ie=utf-8")
    x.start()

file_download_pool.py

import requests
import concurrent.futures as futures

def file_download(url, type='content'):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
    }
    r = requests.get(url, headers=headers)
    if type == 'text':
        return r.text

    return r.content

class DownLoadExecutionerPool():
    def __init__(self):
        super().__init__()
        # 图片保留目录
        self.save_dir = './img_pool/'
        # 图片计数
        self.index = 0
        # 线程池
        self.ex = futures.ThreadPoolExecutor(max_workers=30)

    def put_task(self, urls):
        if isinstance(urls, list):
            for url in urls:
                self.ex.submit(self.save_img, url)
        else:
            self.ex.submit(self.save_img, urls)

    def save_img(self, url):
        content = file_download(url)

        # 截取图片名称
        index = url.rfind('/')
        file_name = url[index+1:]
        save_name = self.save_dir + file_name
        with open(save_name, 'wb+') as f:
            f.write(content)
            self.index += 1
            print(save_name + "下载胜利!  以后已下载图片总数:" + str(self.index))

作者:Recalcitrant
链接:https://www.jianshu.com/p/140…

在线练习:https://www.520mg.com/it

正文完
 0