关于python爬虫:Python库性能测试之re和lxml效率对比

47次阅读

共计 18605 个字符,预计需要花费 47 分钟才能阅读完成。

前言

闲来无事,想起来之前在简书发过的文章还没搬过去,正想搬一篇 lxml 和 re 效率比照的,后果发现代码没了,索性重写一次。

先上后果:

只做解析加取数据,re 比 lxml 快了 300%!
解析加 pandas 解决数据,re 比 lxml 快 40%

其实这个测试后果应该没什么好纠结的,预计应该是 re 优于 lxml,二者都优于 beautifulsoup。

因为很少用 beautifulsoup,所以这次没测试它。

注意事项

为了防止网络稳定,测试时不应该把网络申请工夫计算进去,这里应用参数传入要解析的 HTML。

另外,解析语句的写法优劣会在极大水平上影响后果,所以个别工作重点应该放在表达式的写法上。

小技巧

导出 list 列表数据的时候间接来一个 pandas,省时省力

代码

# -*- encoding: utf-8 -*-
'''
@File    :   test-re-lxml.py
@Time    :   2021 年 12 月 18 日 22:33:10 星期六
@Author  :   erma0
@Version :   1.0
@Link    :   https://erma0.cn
@Desc    :   测试 re lxml 效率
'''

import re
import time
import pandas as pd
import requests
from lxml import etree
from itertools import zip_longest

# ahtml = requests.get('http://test.cn/').text
ahtml = '''<!DOCTYPE HTML PUBLIC"-//W3C//DTD HTML 4.01 Transitional//EN""http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title> 教育网盘 </title>
<style type="text/css">
<!--
td {font-size: 12px;}
a:link {text-decoration: none;}
body {font-size: 12px;}
a:visited {text-decoration: none;}
a:hover {
    color: #FF0000;
    text-decoration: underline;
}
-->
</style>
</head>
<body>
<table width="90%" border="0" align="center" class="list">
<tr bgcolor=#BFE6FD height="20">
  <td width="42" align=center> 图标 </td>
  <td width="381" align=center> 文件名 </td>
  <td width="98" align=center> 所属用户 </td>
  <td width="85" align=center> 大小 </td>
  <td width="132" align=center> 更新工夫 </td>
</tr>
<div class=main_content_2 id=content>
<!-- Copyright(C) 2005-2010  All Rights Reserved. -->
<tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/pdf.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f1%c6%fb%b0%fc%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.pdf>1 汽包装置施工技术交底.pdf</a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>871.49 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.doc> 水冷壁装置施工技术交底.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.doc' title='查看文件 水冷壁装置施工技术交底.doc' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>139.66 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%cb%ae%d1%b9%ca%d4%d1%e9%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc> 水压试验技术交底 - 正本.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%cb%ae%d1%b9%ca%d4%d1%e9%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc' title='查看文件 水压试验技术交底 - 正本.doc' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>173.44 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%d1%cc%b5%c0%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc> 烟道装置施工技术交底 - 正本.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%d1%cc%b5%c0%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc' title='查看文件 烟道装置施工技术交底 - 正本.doc' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>130.84 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f(%d7%ee%d6%d5%a3%a9%c9%bd%ce%f7%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%a4%b3%cc%ca%a9%b9%a4%d7%e9%d6%af%c9%e8%bc%c60610.doc>( 最终)山西东义干熄焦工程施工组织设计 0610.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f(%d7%ee%d6%d5%a3%a9%c9%bd%ce%f7%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%a4%b3%cc%ca%a9%b9%a4%d7%e9%d6%af%c9%e8%bc%c60610.doc' title='查看文件 ( 最终)山西东义干熄焦工程施工组织设计 0610.doc' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>2.4 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%d1%b9%ca%d4%d1%e9%b7%bd%b0%b8(1)10.10.docx> 东义干熄焦锅炉水压试验计划 (1)10.10.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%d1%b9%ca%d4%d1%e9%b7%bd%b0%b8(1)10.10.docx' title='查看文件 东义干熄焦锅炉水压试验计划 (1)10.10.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>103.62 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%c9%bd%ce%f7%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%b7%bd%b0%b80507.doc> 山西干熄焦锅炉水冷壁装置计划 0507.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%c9%bd%ce%f7%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%b7%bd%b0%b80507.doc' title='查看文件 山西干熄焦锅炉水冷壁装置计划 0507.doc' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>711.05 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5%c6%fb%b0%fc%b5%f5%d7%b0%ca%a9%b9%a4%b7%bd%b0%b8+0710.docx> 东义汽包吊装施工计划 0710.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5%c6%fb%b0%fc%b5%f5%d7%b0%ca%a9%b9%a4%b7%bd%b0%b8+0710.docx' title='查看文件 东义汽包吊装施工计划 0710.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>989.66 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2fPQR044(12Cr2MoG%a3%ac273%a1%c113)SMAW%2bGTAW.doc>PQR044(12Cr2MoG,273×13)SMAW+GT...</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2fPQR044(12Cr2MoG%a3%ac273%a1%c113)SMAW%2bGTAW.doc' title='查看文件 PQR044(12Cr2MoG,273×13)SMAW+GTAW.doc' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>20.24 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b9%f8%c2%af%ba%b8%bd%d3%ca%a9%b9%a4%b7%bd%b0%b80525.doc> 锅炉焊接施工计划 0525.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b9%f8%c2%af%ba%b8%bd%d3%ca%a9%b9%a4%b7%bd%b0%b80525.doc' title='查看文件 锅炉焊接施工计划 0525.doc' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>711.5 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%c9%bd%ce%f7%b6%ab%d2%e5%b9%f8%c2%af%b8%d6%bc%dc%b0%b2%d7%b0%ca%a9%b9%a4%b7%bd%b0%b80520.doc> 山西东义锅炉钢架装置施工计划 0520.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%c9%bd%ce%f7%b6%ab%d2%e5%b9%f8%c2%af%b8%d6%bc%dc%b0%b2%d7%b0%ca%a9%b9%a4%b7%bd%b0%b80520.doc' title='查看文件 山西东义锅炉钢架装置施工计划 0520.doc' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>1.08 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5230T%b8%c9%cf%a8%bd%b9%d3%e0%c8%c8%b9%f8%c2%af%b0%b2%d7%b0%b7%bd%b0%b8.docx> 东义 230T 干熄焦余热锅炉装置计划.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5230T%b8%c9%cf%a8%bd%b9%d3%e0%c8%c8%b9%f8%c2%af%b0%b2%d7%b0%b7%bd%b0%b8.docx' title='查看文件 东义 230T 干熄焦余热锅炉装置计划.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>418.04 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ac%bc%be%ca%a9%b9%a4%b7%bd%b0%b8+11.11.docx> 夏季施工计划 11.11.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ac%bc%be%ca%a9%b9%a4%b7%bd%b0%b8+11.11.docx' title='查看文件 夏季施工计划 11.11.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>47.07 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/pdf.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f1%c6%fb%b0%fc%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.pdf>1 汽包装置平安技术交底.pdf</a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>908.46 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%c8%eb%bf%da%d1%cc%b5%c0%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx> 余热锅炉入口烟道装置平安技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%c8%eb%bf%da%d1%cc%b5%c0%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉入口烟道装置平安技术交底.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>50.87 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%b8%d6%bd%e1%b9%b9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx> 余热锅炉钢结构平安技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%b8%d6%bd%e1%b9%b9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉钢结构平安技术交底.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>50.86 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%b5%f5%d7%b0%d6%b8%bb%d3%b0%b2%c8%ab%bd%bb%b5%d7.docx> 吊装指挥平安交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%b5%f5%d7%b0%d6%b8%bb%d3%b0%b2%c8%ab%bd%bb%b5%d7.docx' title='查看文件 吊装指挥平安交底.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>13.59 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%cb%ae%d1%b9%ca%d4%d1%e9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx> 水压试验平安技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%cb%ae%d1%b9%ca%d4%d1%e9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 水压试验平安技术交底.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>48.24 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d2%c6%b6%af%bd%c5%ca%d6%bc%dc%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7+-+10.31.docx> 挪动脚手架平安技术交底 - 10.31.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d2%c6%b6%af%bd%c5%ca%d6%bc%dc%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7+-+10.31.docx' title='查看文件 挪动脚手架平安技术交底 - 10.31.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>45.41 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx> 余热锅炉水冷壁平安技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉水冷壁平安技术交底.docx' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a>  <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120>  <div lign=center>50.98 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr>      </div>
<tr>
  <td colspan="5" align=center><div align="center"> 总共记录数:3276 页码:<a href=http://test.com/newshare.aspx?page=10 class=page>...</a>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=11 class=page>11</a>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=12 class=page>12</a>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=13 class=page>13</a>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=14 class=page>14</a>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=15 class=page>15</a>&nbsp;&nbsp;<span class=10>16</span>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=17 class=page>17</a>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=18 class=page>18</a>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=19 class=page>19</a>&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=20 class=page>20</a>&nbsp;&nbsp;&nbsp;&nbsp;<a href=http://test.com/newshare.aspx?page=21 class=page>...</a><a href=http://test.com/newshare.aspx?page=1> 第一页 </a><a href=http://test.com/newshare.aspx?page=15> 上一页 </a><a href=http://test.com/newshare.aspx?page=17> 下一页 </a><a href=http://test.com/newshare.aspx?page=164> 最末页 </a> 第 16 页 / 共 164 页      </div></td>
  </tr>
</table>
</body>
</html>
'''


def get_lxml(html):
    d = etree.HTML(html)
    link = d.xpath('//tr/td[2]/a[1]/@href')
    title = d.xpath('//tr/td[2]/a[1]/text()')
    passwd = d.xpath('//tr/td[2]/font/text()')
    user = d.xpath('//tr/td[3][@width="120"]/text()')
    size = d.xpath('//tr/td[4]/div/text()')
    time = d.xpath('//tr/td[5]/div/text()')

    datas = list(zip_longest(link, title, passwd, user, size, time, fillvalue=''))
    datas = pd.DataFrame(datas, columns=['link', 'title', 'passwd', 'user', 'size', 'time'])
    datas['link'] = datas['link'].str.strip()
    datas.to_csv('result-lxml.csv')
    # print(len(datas))
    return datas


def get_re(html):
    datas = []
    # datas= r.findall(html)
    datas = re.findall(rep, html, re.S)
    # for data in datas:  # link, title, passwd, user, time
    #     pass
    # print(len(datas))
    datas = pd.DataFrame(datas, columns=['link', 'title', 'passwd', 'user', 'size', 'time'])
    datas['link'] = datas['link'].str.strip()
    datas.to_csv('result-re.csv')
    return datas


if __name__ == '__main__':
    rep = r"<td><a href=([\s\S]*?)>([\s\S]*?)</a>[\s\S]*?<font color=#999999>([\s\S]*?)</font>[\s\S]*?center>([\s\S]*?)</td><td width=120>  <div lign=center>([\s\S]*?)</div></td><td width=120><div align=center>([\s\S]*?)</div>"
    r = re.compile(rep, re.S)
    for name, function in [('lxml', get_lxml), ('re', get_re)]:
        start = time.time()
        for i in range(500):
            function(ahtml)
        # function(ahtml)
        end = time.time()
        print(name, end - start)

后果

lxml 2.1219968795776367
re 1.341965675354004

re 比 lxml 快了靠近 40%

再测试一下纯解析的效率

因为下面代码中计算了 pandas 的数据处理工夫,应用上面把它正文掉再测试一下,代码如下:

def get_lxml(html):
    d = etree.HTML(html)
    link = d.xpath('//tr/td[2]/a[1]/@href')
    title = d.xpath('//tr/td[2]/a[1]/text()')
    passwd = d.xpath('//tr/td[2]/font/text()')
    user = d.xpath('//tr/td[3][@width="120"]/text()')
    size = d.xpath('//tr/td[4]/div/text()')
    time = d.xpath('//tr/td[5]/div/text()')

def get_re(html):
    # datas= r.findall(html)
    datas = re.findall(rep, html, re.S)

后果 2

lxml 0.6280360221862793
re 0.15498137474060059

单纯解析加取数据,re 比 lxml 快了 300%!

正文完
 0