关于go:提速-30-倍OCI-容器启动优化的历程

文本翻译自: https://www.scrivano.org/posts/2022-10-21-the-journey-to-spee…

原文作者是 Red Hat 工程师 Giuseppe Scrivano，其回顾了将 OCI 容器启动的工夫提速 30 倍的历程。

当我开始钻研 crun (https://github.com/containers/crun) 时，我正在寻找一种通过改良 OCI 运行时来更快地启动和进行容器的办法，OCI 运行时是 OCI 堆栈中负责最终与内核交互并设置容器所在环境的组件。

OCI 运行时的运行工夫十分无限，它的工作次要是执行一系列间接映射到 OCI 配置文件的零碎调用。

我很诧异地发现，如此琐碎的工作可能须要破费这么长时间。

免责申明 ：对于我的测试，我应用了 Fedora 装置中可用的默认内核以及所有库。除了这篇博文中形容的修复之外，这些年来可能还有其余可能影响整体性能的修复。

以下所有用于测试的 crun 版本都是雷同的。

对于所有测试，我都应用 hyperfine，它是通过 cargo 装置的。

要比照咱们与过来相差多大，咱们须要回到 2017 年，或者只装置一个旧的 Fedora 映像。对于上面的测试，我应用了基于 Linux 内核 4.5.5 的 Fedora 24。

在新装置的 Fedora 24 上，运行从主分支构建：

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):     159.2 ms ±  21.8 ms    [User: 43.0 ms, System: 16.3 ms]
  Range (min … max):    73.9 ms … 194.9 ms    39 runs

用户工夫和零碎工夫指的是过程别离在用户态和内核态的耗时。

160 毫秒很多，据我所知，这与我五年前察看到的状况类似。

对 OCI 运行时的剖析立刻表明，大部分用户工夫都花在了 libseccomp 上来编译 seccomp 过滤器。

为了验证这一点，让咱们尝试运行一个具备雷同配置但没有 seccomp 配置文件的容器：

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):     139.6 ms ±  20.8 ms    [User: 4.1 ms, System: 22.0 ms]
  Range (min … max):    61.8 ms … 177.0 ms    47 runs

咱们应用了之前所需用户工夫的 1/10（43 ms -> 4.1 ms），整体工夫也有所改善！

所以次要有两个不同的问题：1) 零碎工夫相当长，2) 用户工夫由 libseccomp 管制。咱们须要同时解决这两个问题。

当初让咱们专一于零碎工夫，稍后咱们将回到 seccomp。

创立和销毁网络命名空间已经十分低廉，只需应用该 unshare 工具即可重现该问题，在 Fedora 24 上我失去：

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
  Time (mean ± σ):      47.7 ms ±  51.4 ms    [User: 0.6 ms, System: 3.2 ms]
  Range (min … max):     0.0 ms … 190.5 ms    365 runs

这算是很长的耗时！

我试图在内核中修复它并提出了一个 patch 补丁。Florian Westphal 以更好的形式将其进行了重写，并合并到了 Linux 内核中：

commit 8c873e2199700c2de7dbd5eedb9d90d5f109462b
Author: Florian Westphal
Date:   Fri Dec 1 00:21:04 2017 +0100

    netfilter: core: free hooks with call_rcu

    Giuseppe Scrivano says:
      "SELinux, if enabled, registers for each new network namespace 6
        netfilter hooks."Cost for this is high.  With synchronize_net() removed:"The net benefit on an SMP machine with two cores is that creating a
       new network namespace takes -40% of the original time."

    This patch replaces synchronize_net+kvfree with call_rcu().
    We store rcu_head at the tail of a structure that has no fixed layout,
    i.e. we cannot use offsetof() to compute the start of the original
    allocation.  Thus store this information right after the rcu head.

    We could simplify this by just placing the rcu_head at the start
    of struct nf_hook_entries.  However, this structure is used in
    packet processing hotpath, so only place what is needed for that
    at the beginning of the struct.

    Reported-by: Giuseppe Scrivano
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

commit 26888dfd7e7454686b8d3ea9ba5045d5f236e4d7
Author: Florian Westphal
Date:   Fri Dec 1 00:21:03 2017 +0100

    netfilter: core: remove synchronize_net call if nfqueue is used

    since commit 960632ece6949b ("netfilter: convert hook list to an array")
    nfqueue no longer stores a pointer to the hook that caused the packet
    to be queued.  Therefore no extra synchronize_net() call is needed after
    dropping the packets enqueued by the old rule blob.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

commit 4e645b47c4f000a503b9c90163ad905786b9bc1d
Author: Florian Westphal
Date:   Fri Dec 1 00:21:02 2017 +0100

    netfilter: core: make nf_unregister_net_hooks simple wrapper again

    This reverts commit d3ad2c17b4047
    ("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").

    Nothing wrong with it.  However, followup patch will delay freeing of hooks
    with call_rcu, so all synchronize_net() calls become obsolete and there
    is no need anymore for this batching.

    This revert causes a temporary performance degradation when destroying
    network namespace, but its resolved with the upcoming call_rcu conversion.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

这些补丁产生了微小的差别，当初创立和销毁网络命名空间的工夫曾经降落到了一个难以置信的境地，以下是一个古代 5.19.15 内核的数据：

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
  Time (mean ± σ):       1.5 ms ±   0.5 ms    [User: 0.3 ms, System: 1.3 ms]
  Range (min … max):     0.8 ms …   6.7 ms    1907 runs

挂载 mqueue 也是一个绝对低廉的操作。

在 Fedora 24 上，它已经是这样的：

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
  Time (mean ± σ):      16.8 ms ±   3.1 ms    [User: 2.6 ms, System: 5.0 ms]
  Range (min … max):     9.3 ms …  26.8 ms    261 runs

在这种状况下，我也尝试修复它并提出一个补丁。它没有被承受，但 Al Viro 想出了一个更好的版本来解决这个问题：

commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21
Author: Al Viro
Date:   Mon Dec 25 19:43:35 2017 -0500

    mqueue: switch to on-demand creation of internal mount

    Instead of doing that upon each ipcns creation, we do that the first
    time mq_open(2) or mqueue mount is done in an ipcns.  What's more,
    doing that allows to get rid of mount_ns() use - we can go with
    considerably cheaper mount_nodev(), avoiding the loop over all
    mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
    instance in O(1) time instead of O(instances) mount_ns() would've
    cost us.

    Based upon the version by Giuseppe Scrivano ; I've
    added handling of userland mqueue mounts (original had been broken in
    that area) and added a switch to mount_nodev().

    Signed-off-by: Al Viro

在这个补丁之后，创立 mqueue 挂载的老本也降落了：

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
  Time (mean ± σ):       0.7 ms ±   0.5 ms    [User: 0.5 ms, System: 0.6 ms]
  Range (min … max):     0.0 ms …   3.1 ms    772 runs

我将减速容器启动工夫的事推延了几年，并在 2020 年初从新开始。我意识到的另一个问题是创立和销毁 IPC 命名空间的工夫。

与网络命名空间一样，仅应用以下 unshare 工具即可重现该问题：

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
  Time (mean ± σ):      10.9 ms ±   2.1 ms    [User: 0.5 ms, System: 1.0 ms]
  Range (min … max):     4.2 ms …  17.2 ms    310 runs

与前两次尝试不同，这次我发送的补丁被上游承受了：

commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
Author: Giuseppe Scrivano
Date:   Sun Jun 7 21:40:10 2020 -0700

    ipc/namespace.c: use a work queue to free_ipc

    the reason is to avoid a delay caused by the synchronize_rcu() call in
    kern_umount() when the mqueue mount is freed.

    the code:

        #define _GNU_SOURCE
        #include
        #include
        #include
        #include

        int main()
        {
            int i;

            for (i = 0; i < 1000; i++)
                if (unshare(CLONE_NEWIPC) < 0)
                    error(EXIT_FAILURE, errno, "unshare");
        }

    goes from

            Command being timed: "./ipc-namespace"
            User time (seconds): 0.00
            System time (seconds): 0.06
            Percent of CPU this job got: 0%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05

    to

            Command being timed: "./ipc-namespace"
            User time (seconds): 0.00
            System time (seconds): 0.02
            Percent of CPU this job got: 96%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03

    Signed-off-by: Giuseppe Scrivano
    Signed-off-by: Andrew Morton
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com
    Signed-off-by: Linus Torvalds

有了这个补丁，创立和销毁 IPC 的工夫也大大减少了，正如提交音讯中所概述的那样，在我当初失去的古代 5.19.15 内核上：

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
  Time (mean ± σ):       0.1 ms ±   0.2 ms    [User: 0.2 ms, System: 0.4 ms]
  Range (min … max):     0.0 ms …   1.5 ms    1966 runs

内核态工夫当初仿佛已失去管制。咱们能够做些什么来缩小用户工夫？

正如咱们之前曾经发现的，libseccomp 是这里的罪魁祸首，因而咱们须要首先解决它，这产生在内核中对 IPC 的修复之后。

libseccomp 的大部分老本都是由零碎调用查找代码引起的。OCI 配置文件蕴含一个按名称列出零碎调用的列表，每个零碎调用通过 seccomp_syscall_resolve_name 函数调用进行查找，该函数返回给定零碎调用名称的零碎调用编号。

libseccomp 用于通过零碎调用表对每个零碎调用名称执行线性搜寻，例如，对于 x86_64，它看起来像这样：

/* NOTE: based on Linux v5.4-rc4 */
const struct arch_syscall_def x86_64_syscall_table[] = { \
    {"_llseek", __PNR__llseek},
    {"_newselect", __PNR__newselect},
    {"_sysctl", 156},
    {"accept", 43},
    {"accept4", 288},
    {"access", 21},
    {"acct", 163},
.....
    };

int x86_64_syscall_resolve_name(const char *name)
{
    unsigned int iter;
    const struct arch_syscall_def *table = x86_64_syscall_table;

    /* XXX - plenty of room for future improvement here */
    for (iter = 0; table[iter].name != NULL; iter++) {if (strcmp(name, table[iter].name) == 0)
            return table[iter].num;
    }

    return __NR_SCMP_ERROR;
}

通过 libseccomp 构建 seccomp 配置文件的复杂度为 O(n*m)，其中 n 是配置文件中的零碎调用数量，m 是 libseccomp 已知的零碎调用数量。

我遵循了代码正文中的倡议，并花了一些工夫尝试修复它。2020 年 1 月，我为 libseccomp 开发了一个补丁，以应用完满的哈希函数查找零碎调用名称来解决这个问题。

libseccomp 的补丁是这个：

commit 9b129c41ac1f43d373742697aa2faf6040b9dfab
Author: Giuseppe Scrivano
Date:   Thu Jan 23 17:01:39 2020 +0100

    arch: use gperf to generate a perfact hash to lookup syscall names

    This patch significantly improves the performance of
    seccomp_syscall_resolve_name since it replaces the expensive strcmp
    for each syscall in the database, with a lookup table.

    The complexity for syscall_resolve_num is not changed and it
    uses the linear search, that is anyway less expensive than
    seccomp_syscall_resolve_name as it uses an index for comparison
    instead of doing a string comparison.

    On my machine, calling 1000 seccomp_syscall_resolve_name_arch and
    seccomp_syscall_resolve_num_arch over the entire syscalls DB passed
    from ~0.45 sec to ~0.06s.

    PM: After talking with Giuseppe I made a number of additional
    changes, some substantial, the highlights include:
    * various style tweaks
    * .gitignore fixes
    * fixed subject line, tweaked the description
    * dropped the arch-syscall-validate changes as they were masking
      other problems
    * extracted the syscalls.csv and file deletions to other patches
      to keep this one more focused
    * fixed the x86, x32, arm, all the MIPS ABIs, s390, and s390x ABIs as
      the syscall offsets were not properly incorporated into this change
    * cleaned up the ABI specific headers
    * cleaned up generate_syscalls_perf.sh and renamed to
      arch-gperf-generate
    * fixed problems with automake's file packaging

    Signed-off-by: Giuseppe Scrivano
    Reviewed-by: Tom Hromatka
    [PM: see notes in the "PM" section above]
    Signed-off-by: Paul Moore

该补丁已合并并公布，当初构建 seccomp 配置文件的复杂度为 O(n)，其中 n 是配置文件中零碎调用的数量。

改良是显着的，在足够新的 libseccomp 下：

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):      28.9 ms ±   5.9 ms    [User: 16.7 ms, System: 4.5 ms]
  Range (min … max):    19.1 ms …  41.6 ms    73 runs

用户工夫仅为 16.7ms。以前是 40ms 以上，齐全不必 seccomp 的时候是 4ms 左右。

所以应用 4.1ms 作为没有 seccomp 的用户工夫老本，咱们有：

time_used_by_seccomp_before = 43.0ms - 4.1ms = 38.9ms
time_used_by_seccomp_after = 16.7ms - 4.1ms = 12.6ms

快 3 倍以上！零碎调用查找只是 libseccomp 所做工作的一部分，另外相当多的工夫用于编译 BPF 过滤器。

咱们还能做得更好吗？

BPF 过滤器编译由 seccomp_export_bpf 函数实现，它依然相当低廉。

一个简略的察看是，大多数容器一遍又一遍地重复使用雷同的 seccomp 配置文件，很少进行自定义。

因而缓存编译后果并在可能的状况下重用它是有意义的。

有一个新的运行个性来缓存 BPF 过滤器编译的后果。在撰写本文时，该补丁尚未合并，只管它快要实现了。

有了这个，只有当生成的 BPF 过滤器不在缓存中时，编译 seccomp 配置文件的老本才会被领取，这就是咱们当初所领有的：

# hyperfine 'crun-from-the-future run foo'
Benchmark 1: 'crun-from-the-future run foo'
  Time (mean ± σ):       5.6 ms ±   3.0 ms    [User: 1.0 ms, System: 4.5 ms]
  Range (min … max):     4.2 ms …  26.8 ms    101 runs

五年多来，创立和销毁 OCI 容器所需的总工夫已从将近 160 毫秒减速到略多于 5 毫秒。

这简直是 30 倍的改良！

关于go:提速-30-倍OCI-容器启动优化的历程

2017 年的状况如何

零碎工夫

创立和销毁 network 命名空间

挂载 mqueue

创立和销毁 IPC 命名空间

用户工夫

BPF 过滤器编译

论断