共计 1814 个字符,预计需要花费 5 分钟才能阅读完成。
在 Python 的网络爬虫中,BeautifulSoup 库是一个重要的网页解析工具。在初级教程中,咱们曾经理解了 BeautifulSoup 库的根本应用办法。在本篇文章中,咱们将深刻学习 BeautifulSoup 库的进阶应用。
一、简单的查找条件
在应用 find
和find_all
办法查找元素时,咱们能够应用简单的查找条件,例如咱们能够查找所有 class 为 ”story” 的 p 标签:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""soup = BeautifulSoup(html_doc,'html.parser')
story_p_tags = soup.find_all('p', class_='story')
for p in story_p_tags:
print(p.string)
二、遍历 DOM 树
在 BeautifulSoup 中,咱们能够不便的遍历 DOM 树,以下是一些罕用的遍历办法:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""soup = BeautifulSoup(html_doc,'html.parser')
# 获取间接子节点
for child in soup.body.children:
print(child)
# 获取所有子孙节点
for descendant in soup.body.descendants:
print(descendant)
# 获取兄弟节点
for sibling in soup.p.next_siblings:
print(sibling)
# 获取父节点
print(soup.p.parent)
三、批改 DOM 树
除了遍历 DOM 树,咱们还能够批改 DOM 树,例如咱们能够批改 tag 的内容和属性:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""soup = BeautifulSoup(html_doc,'html.parser')
soup.p.string = 'New story'
soup.p['class'] = 'new_title'
print(soup.p)
四、解析 XML
除了解析 HTML 外,BeautifulSoup 还能够解析 XML,咱们只须要在创立 BeautifulSoup 对象时指定解析器为 ”lxml-xml” 即可:
from bs4 import BeautifulSoup
xml_doc = """
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
</book>
</bookstore>
"""soup = BeautifulSoup(xml_doc,'lxml-xml')
print(soup.prettify())
以上就是 BeautifulSoup 库的进阶应用办法,通过本篇文章,咱们能够更好地应用 BeautifulSoup 库进行网页解析,以便更无效地进行网络爬虫。
正文完