文本翻译自: https://www.scrivano.org/posts/2022-10-21-the-journey-to-spee…
原文作者是 Red Hat 工程师 Giuseppe Scrivano,其回顾了将 OCI 容器启动的工夫提速 30 倍的历程。
当我开始钻研 crun (https://github.com/containers/crun) 时,我正在寻找一种通过改良 OCI 运行时来更快地启动和进行容器的办法,OCI 运行时是 OCI 堆栈中负责最终与内核交互并设置容器所在环境的组件。
OCI 运行时的运行工夫十分无限,它的工作次要是执行一系列间接映射到 OCI 配置文件的零碎调用。
我很诧异地发现,如此琐碎的工作可能须要破费这么长时间。
免责申明 :对于我的测试,我应用了 Fedora 装置中可用的默认内核以及所有库。除了这篇博文中形容的修复之外,这些年来可能还有其余可能影响整体性能的修复。
以下所有用于测试的 crun 版本都是雷同的。
对于所有测试,我都应用 hyperfine,它是通过 cargo 装置的。
2017 年的状况如何
要比照咱们与过来相差多大,咱们须要回到 2017 年,或者只装置一个旧的 Fedora 映像。对于上面的测试,我应用了基于 Linux 内核 4.5.5 的 Fedora 24。
在新装置的 Fedora 24 上,运行从主分支构建:
# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
Time (mean ± σ): 159.2 ms ± 21.8 ms [User: 43.0 ms, System: 16.3 ms]
Range (min … max): 73.9 ms … 194.9 ms 39 runs
用户工夫和零碎工夫指的是过程别离在用户态和内核态的耗时。
160 毫秒很多,据我所知,这与我五年前察看到的状况类似。
对 OCI 运行时的剖析立刻表明,大部分用户工夫都花在了 libseccomp 上来编译 seccomp 过滤器。
为了验证这一点,让咱们尝试运行一个具备雷同配置但没有 seccomp 配置文件的容器:
# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
Time (mean ± σ): 139.6 ms ± 20.8 ms [User: 4.1 ms, System: 22.0 ms]
Range (min … max): 61.8 ms … 177.0 ms 47 runs
咱们应用了之前所需用户工夫的 1/10(43 ms -> 4.1 ms),整体工夫也有所改善!
所以次要有两个不同的问题:1) 零碎工夫相当长,2) 用户工夫由 libseccomp 管制。咱们须要同时解决这两个问题。
当初让咱们专一于零碎工夫,稍后咱们将回到 seccomp。
零碎工夫
创立和销毁 network 命名空间
创立和销毁网络命名空间已经十分低廉,只需应用该 unshare
工具即可重现该问题,在 Fedora 24 上我失去:
# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
Time (mean ± σ): 47.7 ms ± 51.4 ms [User: 0.6 ms, System: 3.2 ms]
Range (min … max): 0.0 ms … 190.5 ms 365 runs
这算是很长的耗时!
我试图在内核中修复它并提出了一个 patch 补丁。Florian Westphal 以更好的形式将其进行了重写,并合并到了 Linux 内核中:
commit 8c873e2199700c2de7dbd5eedb9d90d5f109462b
Author: Florian Westphal
Date: Fri Dec 1 00:21:04 2017 +0100
netfilter: core: free hooks with call_rcu
Giuseppe Scrivano says:
"SELinux, if enabled, registers for each new network namespace 6
netfilter hooks."Cost for this is high. With synchronize_net() removed:"The net benefit on an SMP machine with two cores is that creating a
new network namespace takes -40% of the original time."
This patch replaces synchronize_net+kvfree with call_rcu().
We store rcu_head at the tail of a structure that has no fixed layout,
i.e. we cannot use offsetof() to compute the start of the original
allocation. Thus store this information right after the rcu head.
We could simplify this by just placing the rcu_head at the start
of struct nf_hook_entries. However, this structure is used in
packet processing hotpath, so only place what is needed for that
at the beginning of the struct.
Reported-by: Giuseppe Scrivano
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso
commit 26888dfd7e7454686b8d3ea9ba5045d5f236e4d7
Author: Florian Westphal
Date: Fri Dec 1 00:21:03 2017 +0100
netfilter: core: remove synchronize_net call if nfqueue is used
since commit 960632ece6949b ("netfilter: convert hook list to an array")
nfqueue no longer stores a pointer to the hook that caused the packet
to be queued. Therefore no extra synchronize_net() call is needed after
dropping the packets enqueued by the old rule blob.
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso
commit 4e645b47c4f000a503b9c90163ad905786b9bc1d
Author: Florian Westphal
Date: Fri Dec 1 00:21:02 2017 +0100
netfilter: core: make nf_unregister_net_hooks simple wrapper again
This reverts commit d3ad2c17b4047
("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").
Nothing wrong with it. However, followup patch will delay freeing of hooks
with call_rcu, so all synchronize_net() calls become obsolete and there
is no need anymore for this batching.
This revert causes a temporary performance degradation when destroying
network namespace, but its resolved with the upcoming call_rcu conversion.
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso
这些补丁产生了微小的差别,当初创立和销毁网络命名空间的工夫曾经降落到了一个难以置信的境地,以下是一个古代 5.19.15 内核的数据:
# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
Time (mean ± σ): 1.5 ms ± 0.5 ms [User: 0.3 ms, System: 1.3 ms]
Range (min … max): 0.8 ms … 6.7 ms 1907 runs
挂载 mqueue
挂载 mqueue 也是一个绝对低廉的操作。
在 Fedora 24 上,它已经是这样的:
# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
Time (mean ± σ): 16.8 ms ± 3.1 ms [User: 2.6 ms, System: 5.0 ms]
Range (min … max): 9.3 ms … 26.8 ms 261 runs
在这种状况下,我也尝试修复它并提出一个 补丁。它没有被承受,但 Al Viro 想出了一个更好的版本来解决这个问题:
commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21
Author: Al Viro
Date: Mon Dec 25 19:43:35 2017 -0500
mqueue: switch to on-demand creation of internal mount
Instead of doing that upon each ipcns creation, we do that the first
time mq_open(2) or mqueue mount is done in an ipcns. What's more,
doing that allows to get rid of mount_ns() use - we can go with
considerably cheaper mount_nodev(), avoiding the loop over all
mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
instance in O(1) time instead of O(instances) mount_ns() would've
cost us.
Based upon the version by Giuseppe Scrivano ; I've
added handling of userland mqueue mounts (original had been broken in
that area) and added a switch to mount_nodev().
Signed-off-by: Al Viro
在这个补丁之后,创立 mqueue 挂载的老本也降落了:
# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
Time (mean ± σ): 0.7 ms ± 0.5 ms [User: 0.5 ms, System: 0.6 ms]
Range (min … max): 0.0 ms … 3.1 ms 772 runs
创立和销毁 IPC 命名空间
我将减速容器启动工夫的事推延了几年,并在 2020 年初从新开始。我意识到的另一个问题是创立和销毁 IPC 命名空间的工夫。
与网络命名空间一样,仅应用以下 unshare
工具即可重现该问题:
# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
Time (mean ± σ): 10.9 ms ± 2.1 ms [User: 0.5 ms, System: 1.0 ms]
Range (min … max): 4.2 ms … 17.2 ms 310 runs
与前两次尝试不同,这次我发送的补丁被上游承受了:
commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
Author: Giuseppe Scrivano
Date: Sun Jun 7 21:40:10 2020 -0700
ipc/namespace.c: use a work queue to free_ipc
the reason is to avoid a delay caused by the synchronize_rcu() call in
kern_umount() when the mqueue mount is freed.
the code:
#define _GNU_SOURCE
#include
#include
#include
#include
int main()
{
int i;
for (i = 0; i < 1000; i++)
if (unshare(CLONE_NEWIPC) < 0)
error(EXIT_FAILURE, errno, "unshare");
}
goes from
Command being timed: "./ipc-namespace"
User time (seconds): 0.00
System time (seconds): 0.06
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05
to
Command being timed: "./ipc-namespace"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 96%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
Signed-off-by: Giuseppe Scrivano
Signed-off-by: Andrew Morton
Reviewed-by: Paul E. McKenney
Reviewed-by: Waiman Long
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com
Signed-off-by: Linus Torvalds
有了这个补丁,创立和销毁 IPC 的工夫也大大减少了,正如提交音讯中所概述的那样,在我当初失去的古代 5.19.15 内核上:
# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
Time (mean ± σ): 0.1 ms ± 0.2 ms [User: 0.2 ms, System: 0.4 ms]
Range (min … max): 0.0 ms … 1.5 ms 1966 runs
用户工夫
内核态工夫当初仿佛已失去管制。咱们能够做些什么来缩小用户工夫?
正如咱们之前曾经发现的,libseccomp 是这里的罪魁祸首,因而咱们须要首先解决它,这产生在内核中对 IPC 的修复之后。
libseccomp 的大部分老本都是由零碎调用查找代码引起的。OCI 配置文件蕴含一个按名称列出零碎调用的列表,每个零碎调用通过 seccomp_syscall_resolve_name
函数调用进行查找,该函数返回给定零碎调用名称的零碎调用编号。
libseccomp 用于通过零碎调用表对每个零碎调用名称执行线性搜寻,例如,对于 x86_64,它看起来像这样:
/* NOTE: based on Linux v5.4-rc4 */
const struct arch_syscall_def x86_64_syscall_table[] = { \
{"_llseek", __PNR__llseek},
{"_newselect", __PNR__newselect},
{"_sysctl", 156},
{"accept", 43},
{"accept4", 288},
{"access", 21},
{"acct", 163},
.....
};
int x86_64_syscall_resolve_name(const char *name)
{
unsigned int iter;
const struct arch_syscall_def *table = x86_64_syscall_table;
/* XXX - plenty of room for future improvement here */
for (iter = 0; table[iter].name != NULL; iter++) {if (strcmp(name, table[iter].name) == 0)
return table[iter].num;
}
return __NR_SCMP_ERROR;
}
通过 libseccomp 构建 seccomp 配置文件的复杂度为 O(n*m)
,其中 n
是配置文件中的零碎调用数量,m
是 libseccomp 已知的零碎调用数量。
我遵循了代码正文中的倡议,并花了一些工夫尝试修复它。2020 年 1 月,我为 libseccomp 开发了一个 补丁,以应用完满的哈希函数查找零碎调用名称来解决这个问题。
libseccomp 的补丁是这个:
commit 9b129c41ac1f43d373742697aa2faf6040b9dfab
Author: Giuseppe Scrivano
Date: Thu Jan 23 17:01:39 2020 +0100
arch: use gperf to generate a perfact hash to lookup syscall names
This patch significantly improves the performance of
seccomp_syscall_resolve_name since it replaces the expensive strcmp
for each syscall in the database, with a lookup table.
The complexity for syscall_resolve_num is not changed and it
uses the linear search, that is anyway less expensive than
seccomp_syscall_resolve_name as it uses an index for comparison
instead of doing a string comparison.
On my machine, calling 1000 seccomp_syscall_resolve_name_arch and
seccomp_syscall_resolve_num_arch over the entire syscalls DB passed
from ~0.45 sec to ~0.06s.
PM: After talking with Giuseppe I made a number of additional
changes, some substantial, the highlights include:
* various style tweaks
* .gitignore fixes
* fixed subject line, tweaked the description
* dropped the arch-syscall-validate changes as they were masking
other problems
* extracted the syscalls.csv and file deletions to other patches
to keep this one more focused
* fixed the x86, x32, arm, all the MIPS ABIs, s390, and s390x ABIs as
the syscall offsets were not properly incorporated into this change
* cleaned up the ABI specific headers
* cleaned up generate_syscalls_perf.sh and renamed to
arch-gperf-generate
* fixed problems with automake's file packaging
Signed-off-by: Giuseppe Scrivano
Reviewed-by: Tom Hromatka
[PM: see notes in the "PM" section above]
Signed-off-by: Paul Moore
该补丁已合并并公布,当初构建 seccomp 配置文件的复杂度为 O(n)
,其中 n 是配置文件中零碎调用的数量。
改良是显着的,在足够新的 libseccomp 下:
# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
Time (mean ± σ): 28.9 ms ± 5.9 ms [User: 16.7 ms, System: 4.5 ms]
Range (min … max): 19.1 ms … 41.6 ms 73 runs
用户工夫仅为 16.7ms。以前是 40ms 以上,齐全不必 seccomp 的时候是 4ms 左右。
所以应用 4.1ms 作为没有 seccomp 的用户工夫老本,咱们有:
time_used_by_seccomp_before = 43.0ms - 4.1ms = 38.9ms
time_used_by_seccomp_after = 16.7ms - 4.1ms = 12.6ms
快 3 倍以上!零碎调用查找只是 libseccomp 所做工作的一部分,另外相当多的工夫用于编译 BPF 过滤器。
BPF 过滤器编译
咱们还能做得更好吗?
BPF 过滤器编译由 seccomp_export_bpf
函数实现,它依然相当低廉。
一个简略的察看是,大多数容器一遍又一遍地重复使用雷同的 seccomp 配置文件,很少进行自定义。
因而缓存编译后果并在可能的状况下重用它是有意义的。
有一个新的运行个性 来缓存 BPF 过滤器编译的后果。在撰写本文时,该补丁尚未合并,只管它快要实现了。
有了这个,只有当生成的 BPF 过滤器不在缓存中时,编译 seccomp 配置文件的老本才会被领取,这就是咱们当初所领有的:
# hyperfine 'crun-from-the-future run foo'
Benchmark 1: 'crun-from-the-future run foo'
Time (mean ± σ): 5.6 ms ± 3.0 ms [User: 1.0 ms, System: 4.5 ms]
Range (min … max): 4.2 ms … 26.8 ms 101 runs
论断
五年多来,创立和销毁 OCI 容器所需的总工夫已从将近 160 毫秒减速到略多于 5 毫秒。
这简直是 30 倍的改良!