分享一个php脚本,应用代理ip来拜访网页,不便抓取数据什么的~
什么状况下会用到代理IP?比方你要抓取一个网站数据,该网站有100万条内容,他们做了IP限度,每个IP每小时只能抓1000条,如果单个IP去抓因为受限,须要40天左右能力采集完,如果用了代理IP,不停的切换IP,就能够冲破每小时1000条的频率限度,从而提高效率。
脚本开始:
import requests
from lxml import etree
def get_proxy_list(gourl):
url = gourlpayload = {}headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36", 'Accept': 'application/json, text/javascript, */*; q=0.01',}response = requests.request("GET", url, headers=headers, data=payload)res = []_ = etree.HTML(response.text)type_dct = { "HTTP": "http://", "HTTPS": "https://"}data_list = _.xpath("//tbody/tr")for data in data_list: ip = data.xpath("./td[1]/text()")[0] port = data.xpath("./td[2]/text()")[0] type = data.xpath("./td[4]/text()")[0] res.append(type_dct[type] + ip + ':' + port)return res
测试代理
def check(proxy):
href = 'http://www.baidu.com/'if 'https' in proxy: proxies = {'https': proxy}else: proxies = {'http': proxy}headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4396.0 Safari/537.36'}try: r = requests.get(href, proxies=proxies, timeout=5, headers=headers) if r.status_code == 200: return Trueexcept: return False
if name == '__main__':
proxy_list = get_proxy_list(gourl)print(proxy_list)for p in proxy_list: print(p, check(p))
大家代码复制后,把网址传参道get_proxy_list函数就能够用了。
比方“get_proxy_list(https://www.jxmtjt.com/xy/t6628956009664610563)”
代码我始终在用,大家也能够思否,百度搜寻一下有没有收费的代理ip获取网址~