前言
闲来无事,想起来之前在简书发过的文章还没搬过去,正想搬一篇lxml和re效率比照的,后果发现代码没了,索性重写一次。
先上后果:
只做解析加取数据,re比lxml快了300%!
解析加pandas解决数据,re比lxml快40%
其实这个测试后果应该没什么好纠结的,预计应该是re优于lxml,二者都优于beautifulsoup。
因为很少用beautifulsoup,所以这次没测试它。
注意事项
为了防止网络稳定,测试时不应该把网络申请工夫计算进去,这里应用参数传入要解析的HTML。
另外,解析语句的写法优劣会在极大水平上影响后果,所以个别工作重点应该放在表达式的写法上。
小技巧
导出list列表数据的时候间接来一个pandas,省时省力
代码
# -*- encoding: utf-8 -*-'''@File : test-re-lxml.py@Time : 2021年12月18日 22:33:10 星期六@Author : erma0@Version : 1.0@Link : https://erma0.cn@Desc : 测试re lxml效率'''import reimport timeimport pandas as pdimport requestsfrom lxml import etreefrom itertools import zip_longest# ahtml = requests.get('http://test.cn/').textahtml = '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>教育网盘</title><style type="text/css"><!--td { font-size: 12px;}a:link { text-decoration: none;}body { font-size: 12px;}a:visited { text-decoration: none;}a:hover { color: #FF0000; text-decoration: underline;}--></style></head><body><table width="90%" border="0" align="center" class="list"><tr bgcolor=#BFE6FD height="20"> <td width="42" align=center>图标</td> <td width="381" align=center>文件名</td> <td width="98" align=center>所属用户</td> <td width="85" align=center>大小</td> <td width="132" align=center>更新工夫</td></tr><div class=main_content_2 id=content><!-- Copyright(C) 2005-2010 All Rights Reserved. --><tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/pdf.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f1%c6%fb%b0%fc%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.pdf>1汽包装置施工技术交底.pdf</a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>871.49 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.doc>水冷壁装置施工技术交底.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.doc' title='查看文件 水冷壁装置施工技术交底.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>139.66 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%cb%ae%d1%b9%ca%d4%d1%e9%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc>水压试验技术交底 - 正本.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%cb%ae%d1%b9%ca%d4%d1%e9%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc' title='查看文件 水压试验技术交底 - 正本.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>173.44 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%d1%cc%b5%c0%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc>烟道装置施工技术交底 - 正本.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%d1%cc%b5%c0%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc' title='查看文件 烟道装置施工技术交底 - 正本.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>130.84 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f(%d7%ee%d6%d5%a3%a9%c9%bd%ce%f7%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%a4%b3%cc%ca%a9%b9%a4%d7%e9%d6%af%c9%e8%bc%c60610.doc>(最终)山西东义干熄焦工程施工组织设计0610.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f(%d7%ee%d6%d5%a3%a9%c9%bd%ce%f7%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%a4%b3%cc%ca%a9%b9%a4%d7%e9%d6%af%c9%e8%bc%c60610.doc' title='查看文件 (最终)山西东义干熄焦工程施工组织设计0610.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>2.4 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%d1%b9%ca%d4%d1%e9%b7%bd%b0%b8(1)10.10.docx>东义干熄焦锅炉水压试验计划(1)10.10.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%d1%b9%ca%d4%d1%e9%b7%bd%b0%b8(1)10.10.docx' title='查看文件 东义干熄焦锅炉水压试验计划(1)10.10.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>103.62 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%c9%bd%ce%f7%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%b7%bd%b0%b80507.doc>山西干熄焦锅炉水冷壁装置计划0507.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%c9%bd%ce%f7%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%b7%bd%b0%b80507.doc' title='查看文件 山西干熄焦锅炉水冷壁装置计划0507.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>711.05 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5%c6%fb%b0%fc%b5%f5%d7%b0%ca%a9%b9%a4%b7%bd%b0%b8+0710.docx>东义汽包吊装施工计划 0710.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5%c6%fb%b0%fc%b5%f5%d7%b0%ca%a9%b9%a4%b7%bd%b0%b8+0710.docx' title='查看文件 东义汽包吊装施工计划 0710.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>989.66 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2fPQR044(12Cr2MoG%a3%ac273%a1%c113)SMAW%2bGTAW.doc>PQR044(12Cr2MoG,273×13)SMAW+GT...</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2fPQR044(12Cr2MoG%a3%ac273%a1%c113)SMAW%2bGTAW.doc' title='查看文件 PQR044(12Cr2MoG,273×13)SMAW+GTAW.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>20.24 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b9%f8%c2%af%ba%b8%bd%d3%ca%a9%b9%a4%b7%bd%b0%b80525.doc>锅炉焊接施工计划0525.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b9%f8%c2%af%ba%b8%bd%d3%ca%a9%b9%a4%b7%bd%b0%b80525.doc' title='查看文件 锅炉焊接施工计划0525.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>711.5 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%c9%bd%ce%f7%b6%ab%d2%e5%b9%f8%c2%af%b8%d6%bc%dc%b0%b2%d7%b0%ca%a9%b9%a4%b7%bd%b0%b80520.doc>山西东义锅炉钢架装置施工计划0520.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%c9%bd%ce%f7%b6%ab%d2%e5%b9%f8%c2%af%b8%d6%bc%dc%b0%b2%d7%b0%ca%a9%b9%a4%b7%bd%b0%b80520.doc' title='查看文件 山西东义锅炉钢架装置施工计划0520.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>1.08 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5230T%b8%c9%cf%a8%bd%b9%d3%e0%c8%c8%b9%f8%c2%af%b0%b2%d7%b0%b7%bd%b0%b8.docx>东义230T干熄焦余热锅炉装置计划.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5230T%b8%c9%cf%a8%bd%b9%d3%e0%c8%c8%b9%f8%c2%af%b0%b2%d7%b0%b7%bd%b0%b8.docx' title='查看文件 东义230T干熄焦余热锅炉装置计划.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>418.04 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ac%bc%be%ca%a9%b9%a4%b7%bd%b0%b8+11.11.docx>夏季施工计划 11.11.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ac%bc%be%ca%a9%b9%a4%b7%bd%b0%b8+11.11.docx' title='查看文件 夏季施工计划 11.11.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>47.07 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/pdf.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f1%c6%fb%b0%fc%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.pdf>1汽包装置平安技术交底.pdf</a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>908.46 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%c8%eb%bf%da%d1%cc%b5%c0%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx>余热锅炉入口烟道装置平安技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%c8%eb%bf%da%d1%cc%b5%c0%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉入口烟道装置平安技术交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>50.87 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%b8%d6%bd%e1%b9%b9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx>余热锅炉钢结构平安技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%b8%d6%bd%e1%b9%b9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉钢结构平安技术交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>50.86 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%b5%f5%d7%b0%d6%b8%bb%d3%b0%b2%c8%ab%bd%bb%b5%d7.docx>吊装指挥平安交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%b5%f5%d7%b0%d6%b8%bb%d3%b0%b2%c8%ab%bd%bb%b5%d7.docx' title='查看文件 吊装指挥平安交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>13.59 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%cb%ae%d1%b9%ca%d4%d1%e9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx>水压试验平安技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%cb%ae%d1%b9%ca%d4%d1%e9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 水压试验平安技术交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>48.24 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d2%c6%b6%af%bd%c5%ca%d6%bc%dc%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7+-+10.31.docx>挪动脚手架平安技术交底 - 10.31.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d2%c6%b6%af%bd%c5%ca%d6%bc%dc%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7+-+10.31.docx' title='查看文件 挪动脚手架平安技术交底 - 10.31.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>45.41 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx>余热锅炉水冷壁平安技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉水冷壁平安技术交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>50.98 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> </div><tr> <td colspan="5" align=center><div align="center">总共记录数:3276页码:<a href=http://test.com/newshare.aspx?page=10 class=page>...</a> <a href=http://test.com/newshare.aspx?page=11 class=page>11</a> <a href=http://test.com/newshare.aspx?page=12 class=page>12</a> <a href=http://test.com/newshare.aspx?page=13 class=page>13</a> <a href=http://test.com/newshare.aspx?page=14 class=page>14</a> <a href=http://test.com/newshare.aspx?page=15 class=page>15</a> <span class=10>16</span> <a href=http://test.com/newshare.aspx?page=17 class=page>17</a> <a href=http://test.com/newshare.aspx?page=18 class=page>18</a> <a href=http://test.com/newshare.aspx?page=19 class=page>19</a> <a href=http://test.com/newshare.aspx?page=20 class=page>20</a> <a href=http://test.com/newshare.aspx?page=21 class=page>...</a><a href=http://test.com/newshare.aspx?page=1>第一页</a><a href=http://test.com/newshare.aspx?page=15>上一页</a><a href=http://test.com/newshare.aspx?page=17>下一页</a><a href=http://test.com/newshare.aspx?page=164>最末页</a>第16页/共164页 </div></td> </tr></table></body></html>'''def get_lxml(html): d = etree.HTML(html) link = d.xpath('//tr/td[2]/a[1]/@href') title = d.xpath('//tr/td[2]/a[1]/text()') passwd = d.xpath('//tr/td[2]/font/text()') user = d.xpath('//tr/td[3][@width="120"]/text()') size = d.xpath('//tr/td[4]/div/text()') time = d.xpath('//tr/td[5]/div/text()') datas = list(zip_longest(link, title, passwd, user, size, time, fillvalue='')) datas = pd.DataFrame(datas, columns=['link', 'title', 'passwd', 'user', 'size', 'time']) datas['link'] = datas['link'].str.strip() datas.to_csv('result-lxml.csv') # print(len(datas)) return datasdef get_re(html): datas = [] # datas= r.findall(html) datas = re.findall(rep, html, re.S) # for data in datas: # link, title, passwd, user, time # pass # print(len(datas)) datas = pd.DataFrame(datas, columns=['link', 'title', 'passwd', 'user', 'size', 'time']) datas['link'] = datas['link'].str.strip() datas.to_csv('result-re.csv') return datasif __name__ == '__main__': rep = r"<td><a href=([\s\S]*?)>([\s\S]*?)</a>[\s\S]*?<font color=#999999>([\s\S]*?)</font>[\s\S]*?center>([\s\S]*?)</td><td width=120> <div lign=center>([\s\S]*?)</div></td><td width=120><div align=center>([\s\S]*?)</div>" r = re.compile(rep, re.S) for name, function in [('lxml', get_lxml), ('re', get_re)]: start = time.time() for i in range(500): function(ahtml) # function(ahtml) end = time.time() print(name, end - start)
后果
lxml 2.1219968795776367re 1.341965675354004
re比lxml快了靠近40%
再测试一下纯解析的效率
因为下面代码中计算了pandas的数据处理工夫,应用上面把它正文掉再测试一下,代码如下:
def get_lxml(html): d = etree.HTML(html) link = d.xpath('//tr/td[2]/a[1]/@href') title = d.xpath('//tr/td[2]/a[1]/text()') passwd = d.xpath('//tr/td[2]/font/text()') user = d.xpath('//tr/td[3][@width="120"]/text()') size = d.xpath('//tr/td[4]/div/text()') time = d.xpath('//tr/td[5]/div/text()')def get_re(html): # datas= r.findall(html) datas = re.findall(rep, html, re.S)
后果2
lxml 0.6280360221862793re 0.15498137474060059
单纯解析加取数据,re比lxml快了300%!