批量抓取网页pdf文件

jiezi

6 年前

任务：批量抓取网页 pdf 文件
有一个 excel，里面有数千条指向 pdf 下载链接的网页地址，现在，需要批量抓取这些网页地址中的 pdf 文件。python 环境：
anaconda3openpyxlbeautifulsoup4
读取 excel，获取网页地址
使用 openpyxl 库，读取.xslx 文件；（曾尝试使用 xlrd 库读取.xsl 文件，但无法获取超链接）

安装 openpyxl
pip install openpyxl

提取 xslx 文件中的超链接
示例文件构造

公告日期
证券代码
公告标题

2018-04-20
603999.SH
读者传媒:2017 年年度报告

2018-04-28
603998.SH
方盛制药:2017 年年度报告

def readxlsx(path):
workbook = openpyxl.load_workbook(path)
Data_sheet = workbook.get_sheet_by_name(‘sheet1’)
rowNum = Data_sheet.max_row #读取最大行数
c = 3 # 第三列是所需要提取的数据
server = ‘http://news.windin.com/ns/’
for row in range(1, rowNum + 1):
link = Data_sheet.cell(row=row, column=c).value
url = re.split(r’\”‘, link)[1]
print(url)
downEachPdf(url, server)
获取网页 pdf 下载地址
进入读者传媒:2017 年年度报告，在 chrome 浏览器中可以按 F12 查看网页源码，以下截取部分源码：
<div class=”box4″><div style=’float:left;width:40px;background-color:#ffffff;’> 附件:</div> <div style=float:left;width:660px;background-color:#f3f3f3;’> <a href=[getatt.php?id=91785868&att_id=32276645](http://news.windin.com/ns/getatt.php?id=91785868&att_id=32276645) class=’big’ title=603999 读者传媒 2017 年年度报告.pdf>603999 读者传媒 2017 年年度报告.pdf </a>   (2.00M)  &nbsp</div></div>
可见，herf 下载链接在 a 标签中，可以通过解析 html 源码获取下载链接。这里使用 BeautifulSoup 解析 html。
Beautiful Soup 是用 Python 写的一个 HTML/XML 的解析器，它可以很好的处理不规范标记并生成剖析树 (parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。它可以大大节省你的编程时间。
安装 BeautifulSoup4
pip install beautifulsoup4
获取 pdf 下载链接并下载
def downEachPdf(target, server):
req = requests.get(url=target)
html = req.text
bf = BeautifulSoup(html, features=”lxml”)
a = bf.find_all(‘a’)
for each in a:
url = server + each.get(‘href’)
print(“downloading:”, each.string, url)
urllib.request.urlretrieve(url, ‘./report/’ + each.string)
同一 ip 重复访问同一服务器被拒绝
利用以上方法已经能够实现批量网页 pdf 的下载了，但是，在实际操作过程中，会发现如果同一 ip 频繁地访问某一服务器，访问会被拒绝（可能被误判为 DOS 攻击，通常做了 Rate-limit 的网站都会停止响应一段时间，你可以 Catch 这个 Exception，sleep 一段时间，参考）。因此，对下载逻辑进行了调整。利用 try-catch，具体逻辑是：正常情况下，按次序下载文件，如果同一文件，下载失败次数超过 10，则跳过，下载下一个文件，并记录错误信息。
import os
import time
def downloadXml(flag_exists, file_dir, file_name, xml_url):
if not flag_exists:
os.makedirs(file_dir)
local = os.path.join(file_dir, file_name)
try:
urllib.request.urlretrieve(xml_url, local)
except Exception as e:
print(‘the first error: ‘, e)
cur_try = 0
total_try = 10
if cur_try < total_try:
cur_try += 1
time.sleep(15)
return downloadXml(flag_exists, file_dir, file_name, xml_url)
else:
print(‘the last error: ‘)
with open(test_dir + ‘error_url.txt’, ‘a’) as f:
f.write(xml_url)
raise Exception(e)