BeautifulSoup-使用指北-0x03搜索解析树

jiezi

5 年前

GitHub@orca-j35，所有笔记均托管于 python_notes 仓库。
欢迎任何形式的转载，但请务必注明出处。
参考: https://www.crummy.com/softwa…

BeautifulSoup 中定义了许多搜索解析树的方法，但这些方法都非常类似，它们大多采用与 find_all() 相同的参数: name、attrs、string、limit 和 **kwargs，但是仅有 find() 和 find_all() 支持 recursive 参数。

这里着重介绍 find() 和 find_all()，其它 ” 搜索方法 ” 也这两个类似。

本节会以 “three sister” 作为示例:

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html_doc, 'html.parser')

过滤器 (filter) 用于在解析树中筛选目标节点，被用作 ” 搜索方法 ” 的实参。

字符串可用作过滤器，BeautifulSoup 可利用字符串来筛选节点，并保留符合条件节点:

使用字符串筛选 tag 时，会保留与字符串同名 tag 节点，且总会过滤掉 HTML 文本节点
使用字符串筛选 HTML 属性时，会保留属性值与字符串相同的 tag 节点，且总会过滤掉 HTML 文本节点
使用字符串筛选 HTML 文本时，会保留与字符串相同的文本节点

与 str 字符串类似，我们还可将 bytes 对象用作过滤器，区别在于 BeautifulSoup 会假定编码模式为 UTF-8。

示例:

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找名为 b 的 tag 节点
print([f"{type(i)}::{i.name}" for i in soup.find_all('b')])
print([f"{type(i)}::{i.name}" for i in soup.find_all(b'b')])
# 查找 id 值为 link1 的 tag 节点
print([f"{type(i)}::{i.name}" for i in soup.find_all(id='link1')])
# 查找文本值为 Elsie 的文本节点
print([f"{type(i)}::{i.name}" for i in soup.find_all(text='Elsie')])

输出:

["<class'bs4.element.Tag'>::b"]
["<class'bs4.element.Tag'>::b"]
["<class'bs4.element.Tag'>::a"]
["<class'bs4.element.NavigableString'>::None"]

正则表达式对象可用作过滤器，BeautifulSoup 会利用正则表达式对象的 search() 方法来筛选节点，并保留符合条件节点:

使用正则表达式对象筛选 tag 时，会利用正则表达式的 search() 方法来筛选 tag 节点的名称，并保留符合条件的 tag 节点。因为文本节点的 .name 属性值为 None，因此总会过滤掉 HTML 文本节点
使用正则表达式对象筛选 HTML 属性时，会利用正则表达式的 search() 方法来筛选指定属性的值，并保留符合条件的 tag 节点。因为文本节点不包含任何 HTML 属性，因此总会过滤掉 HTML 文本节点
使用正则表达式对象筛选 HTML 文本时，会利用正则表达式的 search() 方法来筛选文本节点，并保留符合条件的文本节点。

示例:

import re

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找名称中包含字母 b 的节点
print([f"{type(i)}::{i.name}" for i in soup.find_all(re.compile(r'b'))])
# 查找 class 值以 t 开头的 tag
print([f"{type(i)}::{i.name}" for i in soup.find_all(class_=re.compile(r'^t'))])
# 查找文本值以 E 开头的文本节点
print([f"{type(i)}::{i.name}" for i in soup.find_all(text=re.compile(r'^E'))])

输出:

["<class'bs4.element.Tag'>::body", "<class'bs4.element.Tag'>::b"]
["<class'bs4.element.Tag'>::p"]
["<class'bs4.element.NavigableString'>::None"]

列表 list 可用作过滤器，列表中的项可以是:

字符串
正则表达式对象
可调用对象，详见函数

BeautifulSoup 会利用列表中的项来筛选节点，并保留符合条件节点:

使用列表筛选 tag 时，若 tag 名与列表中的某一项匹配，则会保留该 tag 节点，且总会过滤掉 HTML 文本节点
使用列表筛选 HTML 属性时，若属性值与列表中的某一项匹配，则会保留该 tag 节点，且总会过滤掉 HTML 文本节点
使用列表筛选 HTML 文本时，若文本与列表中的某一项匹配，则会保留该文本节点

示例

import re
def func(tag):
    return tag.get('id') == "link1"

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找与列表匹配的 tag 节点
tag = soup.find_all(['title', re.compile('b$'), func])
pprint([f"{type(i)}::{i.name}" for i in tag])
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(text=["Elsie", "Tillie"])])

输出:

["<class'bs4.element.Tag'>::title",
 "<class'bs4.element.Tag'>::b",
 "<class'bs4.element.Tag'>::a"]
["<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None"]

布尔值 True 可用作过滤器:

使用 True 筛选 tag 时，会保留所有 tag 节点，且过滤掉所有 HTML 文本节点
使用 True 筛选 HTML 属性时，会保留所有具备该属性的 tag 节点，且过滤掉所有 HTML 文本节点
使用 True 筛选 HTML 文本时，会保留所有文本节点

soup = BeautifulSoup(html_doc, 'html.parser')
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(True)])
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(id=True)])
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(text=True)])

输出:

["<class'bs4.element.Tag'>::html",
 "<class'bs4.element.Tag'>::head",
 "<class'bs4.element.Tag'>::title",
 "<class'bs4.element.Tag'>::body",
 "<class'bs4.element.Tag'>::p",
 "<class'bs4.element.Tag'>::b",
 "<class'bs4.element.Tag'>::p",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::p"]
["<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a"]
["<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None",
 "<class'bs4.element.NavigableString'>::None"]

过滤器可以是某个函数(或任何可调用对象):

以 tag 节点为筛选对象时，过滤器函数需以 tag 节点作为参数，如果函数返回 True，则保留该 tag 节点，否则抛弃该节点。

示例 – 筛选出含 class 属性，但不含 id 属性的 tag 节点:

def has_class_but_no_id(tag):
    # Here’s a function that returns True if a tag defines the“class”attribute but doesn’t define the“id”attribute
    return tag.has_attr('class') and not tag.has_attr('id')


soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(has_class_but_no_id)
pprint([f"{type(i)}::{i.name}" for i in tag])

输出:

["<class'bs4.element.Tag'>::p",
 "<class'bs4.element.Tag'>::p",
 "<class'bs4.element.Tag'>::p"]

针对 HTML 属性进行筛选时，过滤函数需以属性值作为参数，而非整个 tag 节点。如果 tag 节点包含目标属性，则会向过滤函数传递 None，否则传递实际值。如果函数返回 True，则保留该 tag 节点，否则抛弃该节点。

def not_lacie(href):
    # Here’s a function that finds all a tags whose href attribute does not match a regular expression
    return href and not re.compile("lacie").search(href)


soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(href=not_lacie)
for i in tag:
    print(f"{type(i)}::{i.name}::{i}")

输出:

<class 'bs4.element.Tag'>::a::<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<class 'bs4.element.Tag'>::a::<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

针对 HTML 文本进行筛选时，过滤需以文本值作为参数，而非整个 tag 节点。如果函数返回 True，则保留该 tag 节点，否则抛弃该节点。

def func(text):
    return text == "Lacie"

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i}" for i in soup.find_all(text=func)])

输出:

["<class'bs4.element.NavigableString'>::Lacie"]

过滤函数可以被设计的非常复杂，比如:

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

def surrounded_by_strings(tag):
    # returns True if a tag is surrounded by string objects
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(surrounded_by_strings)
pprint([f"{type(i)}::{i.name}" for i in tag])
# 注意空白符对输出结果的影响

输出:

["<class'bs4.element.Tag'>::body",
 "<class'bs4.element.Tag'>::p",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::p"]

????find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

该方法会检索当前 tag 对象的所有子孙节点，并提取与给定条件匹配的所有节点对象，然后返回一个包含这些 节点对象 的列表。

name 是用来筛选 tag 名称的过滤器，find_all() 会保留与 name 过滤器匹配的 tag 对象。使用 name 参数时，会自动过滤 HTML 文本节点，因为文本节点的 .name 字段为 None。

前面提到的五种过滤器均可用作 name 参数，即字符串、正则表达式、列表、True、函数(可调用对象)。

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i.name}" for i in soup.find_all('title')])
#> ["<class'bs4.element.Tag'>::title"]

函数定义中未包含的关键字参数将被视作 HTML 属性过滤器，find_all() 会保留属性值与 var-keyword 匹配的 tag 对象。使用 var-keyword 时，会自动过滤 HTML 文本节点，因为文本节不含有 HTML 属性。

前面提到的五种过滤器均可用作 var-keyword 的值，即字符串、正则表达式、列表、True、函数(可调用对象)。

soup = BeautifulSoup(html_doc, 'html.parser')
# 搜索 id 值为 link2 的 tag 节点
print([f"{type(i)}::{i.name}" for i in soup.find_all(id='link2')])
# 搜索 href 值以字母 'e' 结尾的 tag 节点
print([f"{type(i)}::{i.name}" for i in soup.find_all(href=re.compile(r"e$"))])
# 搜索具备 id 属性的 tag 节点
print([f"{type(i)}::{i.name}" for i in soup.find_all(id=True)])
# 过滤多个 HTML 属性
print([f"{type(i)}::{i.name}"
    for i in soup.find_all(class_="sister", href=re.compile(r"tillie"))
])

输出:

["<class'bs4.element.Tag'>::a"]
["<class'bs4.element.Tag'>::a", "<class'bs4.element.Tag'>::a", "<class'bs4.element.Tag'>::a"]
["<class'bs4.element.Tag'>::a", "<class'bs4.element.Tag'>::a", "<class'bs4.element.Tag'>::a"]
["<class'bs4.element.Tag'>::a"]

var-keyword 参数 string 与 text 参数等效:

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i}" for i in soup.find_all(string=re.compile("sisters"))])
#> ["<class'bs4.element.NavigableString'>::Once upon a time there were three little sisters; and their names were\n"]
print([f"{type(i)}::{i}" for i in soup.find_all(text=re.compile("sisters"))])
#> ["<class'bs4.element.NavigableString'>::Once upon a time there were three little sisters; and their names were\n"]

string 是在 Beautiful Soup 4.4.0 中新加入的，在之前的版本中只能使用 text 参数。

HTML 5 中的部分属性并不符合 Python 的命名规则，不能用作 var-keyword 参数，此时需要使用 attrs 参数来过滤这些属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
#> SyntaxError: keyword can't be an expression

print([f"{type(i)}::{i.name}"
    for i in data_soup.find_all(attrs={"data-foo": "value"})
])
#> ["<class'bs4.element.Tag'>::div"

var-keyword 参数不能用于过滤 HTML tag 的 name 属性，因为在 find_all() 的函数定义中已占用了变量名 name。如果要过滤 name 属性，可使用 attrs 参数来完成。

soup = BeautifulSoup(html_doc, 'html.parser')
name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
print([f"{type(i)}::{i.name}" for i in name_soup.find_all(name="email")])
print([f"{type(i)}::{i.name}" for i in name_soup.find_all(attrs={"name": "email"})
])

输出:

[]
["<class'bs4.element.Tag'>::input"]

CSS 的 class 属性是 Python 的保留关键字，从 BeautifulSoup 4.1.2 开始，可使用 var-keyword 参数 class_ 来筛选 CSS 的 class 属性。使用 var-keyword 时，会自动过滤 HTML 文本节点，因为文本节不含有 HTML 属性。

前面提到的五种过滤器均可用作 class_ 的值，即字符串、正则表达式、列表、True、函数(可调用对象)。

# 搜索 class 时 sister 的 a 标签
soup = BeautifulSoup(html_doc, 'html.parser')
pprint([f"{type(i)}::{i.name}" for i in soup.find_all("a", class_="sister")])

# 搜索 class 中包含 itl 字段的标签
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(class_=re.compile("itl"))])

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6
# 搜索 class 值长度为 6 的标签
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(class_=has_six_characters)])

pprint([f"{type(i)}::{i.name}" for i in soup.find_all(class_=['title', "story"])])

输出:

["<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a"]
["<class'bs4.element.Tag'>::p"]
["<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a"]
["<class'bs4.element.Tag'>::p",
 "<class'bs4.element.Tag'>::p",
 "<class'bs4.element.Tag'>::p"]

CSS 的 class 属性可能会包含多个值，如果 class_ 仅匹配单个值，则会筛选出所有包含此 CSS class 的 tag 标签；如果 class_ 匹配多个值时，会严格按照 CSS class 的顺序进行匹配，即使内容完全一样，但顺序不一致也会匹配失败:

css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
print(css_soup.find_all(class_='body'))
#> [<p class="body strikeout"></p>]
print(css_soup.find_all(class_='strikeout'))
#> [<p class="body strikeout"></p>]

print(css_soup.find_all("p", class_="body strikeout"))
#> [<p class="body strikeout"></p>]
print(css_soup.find_all("p", class_="strikeout body"))
#> []

因此，当你想要依据多个 CSS class 来搜索需要的 tag 标签时，为了不免因顺序不一致而搜索失败，应使用 CSS 选择器:

print(css_soup.select("p.strikeout.body"))
#> [<p class="body strikeout"></p>]

在 BeautifulSoup 4.1.2 之前不能使用 class_ 参数，此时可通过 attrs 参数来完成搜索:

soup = BeautifulSoup(html_doc, 'html.parser')
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(attrs={"class": "sister"})])

pprint([f"{type(i)}::{i.name}" for i in soup.find_all(attrs="sister")])

输出:

["<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a"]
["<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a",
 "<class'bs4.element.Tag'>::a"]

可以向 attrs 传递以下两种类型的实参值:

过滤器 – 此时 .find_all() 会查找 CSS class 的值与该过滤器匹配的 tag 标签，前面提到的五种过滤器均可使用。

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all("p", "title"))
#> [<p class="title"><b>The Dormouse's story</b></p>]

print([f"{type(i)}::{i.name}" for i in soup.find_all(attrs="sister")])
#> ["<class'bs4.element.Tag'>::a", "<class'bs4.element.Tag'>::a", "<class'bs4.element.Tag'>::a"]

映射对象 – .find_all() 会把映射对象中的键值对视作 HTML 属性名和属性值，并找出拥有配匹属性的 tag 标签，前面提到的五种过滤器均可用作映射对象的值。

soup = BeautifulSoup(html_doc, 'html.parser')

pprint([f"{type(i)}::{i.name}" for i in soup.find_all(attrs={
        "class": "sister",
        "id": "link1",
    })
])
#> ["<class'bs4.element.Tag'>::a"]

The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text

text 是用来筛选文本标签的过滤器，find_all() 会保留与 text 过滤器匹配的文本标签，前面提到的五种过滤器均可用作 text 的实参。

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all(string="Elsie"))
print(soup.find_all(string=["Tillie", "Elsie", "Lacie"]))
print(soup.find_all(string=re.compile("Dormouse")))


def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)


print(soup.find_all(string=is_the_only_string_within_a_tag))

输出:

['Elsie']
['Elsie', 'Lacie', 'Tillie']
["The Dormouse's story","The Dormouse's story"]
["The Dormouse's story","The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...']

在查找 tag 标签时，text 被视作筛选条件，find_all() 会筛选出 .string 字段与 text 过滤器匹配的 tag 标签:

soup = BeautifulSoup(html_doc, 'html.parser')

print([f'{type(i)}::{i}' for i in soup.find_all("a", string="Elsie")])
#> ['<class \'bs4.element.Tag\'>::<a class="sister"href="http://example.com/elsie"id="link1">Elsie</a>']

默认情况下 find_all() 会返回所有匹配到的标签对象，如果并不需要获取全部标签对象，可使用 limit 参数来控制对象的数量，此时 BeautifulSoup 会在搜索到 limit 个标签对象后停止搜索。

soup = BeautifulSoup(html_doc, 'html.parser')
# There are three links in the“three sisters”document,
# but this code only finds the first two
print([f'{type(i)}::{i.name}' for i in soup.find_all("a", limit=2)])
#> ["<class'bs4.element.Tag'>::a", "<class'bs4.element.Tag'>::a"]

默认情况下 find_all() 会检索当前 tag 对象的所有子孙节点，并提取与给定条件匹配的所有节点对象，然后返回一个包含这些 节点对象 的列表。如果不想递归检索所有子孙节点，可使用 recursive 进行限制: 当 recursive=False 时，只会检索直接子节点:

soup = BeautifulSoup(html_doc, 'html.parser')

print([f'{type(i)}::{i.name}' for i in soup.find_all("title")])
#> ["<class'bs4.element.Tag'>::title"]
print([f'{type(i)}::{i.name}' for i in soup.find_all("title", recursive=False)])
#> []

调用 `Tag` 对象

在使用 BeautifulSoup 时，find_all() 是最常用的检索方法，因此开发人员为 find_all() 提供了更简便的调用方法——我们在调用 Tag 对象时，便是在调用其 find_all() 方法，源代码如下:

def __call__(self, *args, **kwargs):
    """Calling a tag like a function is the same as calling its
        find_all() method. Eg. tag('a') returns a list of all the A tags
        found within this tag."""
    return self.find_all(*args, **kwargs)

示例 :

soup("a") # 等效于 soup.find_all("a")
soup.title(string=True) # 等效于 soup.title.find_all(string=True)

????find(name, attrs, recursive, string, **kwargs)

find() 方法会只会返回第一个被匹配到的标签对象，如果没有与之匹配的标签则会返回 None。在解析树中使用节点名称导航时，实际上就是在使用 find() 方法。

在理解下面这些方法时，请交叉参考笔记﹝BeautifulSoup – 解析树.md﹞中的 ” 在解析树中导航 ” 一节，以便理解解析树的结构。

本节中不会详细解释各个方法的含义，只会给出函数签名和文档参考连接。

????find_parents(name, attrs, string, limit, **kwargs)

????find_parent(name, attrs, string, **kwargs)