K8S Internals 系列：第一期

容器编排之争在 Kubernetes 一统天下场面造成后，K8S 成为了云原生时代的新一代操作系统。K8S 让所有变得简略了，但本身逐步变得越来越简单。【K8S Internals 系列专栏】围绕 K8S 生态的诸多方面，将由博云容器云研发团队定期分享无关调度、平安、网络、性能、存储、利用场景等热点话题。心愿大家在享受 K8S 带来的高效便当的同时，又能够如庖丁解牛般领略其内核运行机制的魅力。

1. Pod Security Policy 简介

因为Pod Security Admission指标是代替 Pod Security Policy，所以介绍它之前有必要先介绍一下Pod Security Policy，Pod Security Policy定义了一组Pod运行时必须遵循的条件及相干字段的默认值，Pod必须满足这些条件能力被胜利创立,Pod Security Policy对象Spec蕴含以下字段也即是Pod Security Policy可能管制的方面：

管制的角度	字段名称
运行特权容器	privileged
应用宿主名字空间	hostPID,hostIPC
应用宿主的网络和端口	hostNetwork, hostPorts
管制卷类型的应用	volumes
应用宿主文件系统	allowedHostPaths
容许应用特定的 FlexVolume 驱动	allowedFlexVolumes
调配领有 Pod 卷的 FSGroup 账号	fsGroup
以只读形式拜访根文件系统	readOnlyRootFilesystem
设置容器的用户和组 ID	runAsUser, runAsGroup, supplementalGroups
限度 root 账号特权级晋升	allowPrivilegeEscalation, defaultAllowPrivilegeEscalation
Linux 性能（Capabilities）	defaultAddCapabilities, requiredDropCapabilities, allowedCapabilities
设置容器的 SELinux 上下文	seLinux
指定容器能够挂载的 proc 类型	allowedProcMountTypes
指定容器应用的 AppArmor 模版	annotations
指定容器应用的 seccomp 模版	annotations
指定容器应用的 sysctl 模版	forbiddenSysctls,allowedUnsafeSysctls

其中AppArmor 和seccomp 须要通过给PodSecurityPolicy对象增加注解的形式设定：

seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default'seccomp.security.alpha.kubernetes.io/defaultProfileNames: 'docker/default'apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default' apparmor.security.beta.kubernetes.io/defaultProfileNames: 'runtime/default'

Pod Security Policy是集群级别的资源，咱们看一下它的应用流程：

PSP 应用流程

因为须要创立ClusterRole/Role和ClusterRoleBinding/RoleBinding绑定服务账号来应用PSP,这使得咱们不能很容易的看出到底应用了哪些PSP,更难看出Pod的创立被哪些平安规定限度。

2. 为什么呈现Pod Security Admission

通过对PodSecurityPolicy应用，应该也会发现它的问题，例如没有dry-run和审计模式、不不便开启和敞开等，并且应用起来也不那么清晰。种种缺点造成的后果是PodSecurityPolicy在Kubernetes v1.21被标记为弃用，并且将在 v1.25中被移除，在kubernets v1.22中则减少了新个性Pod Security Admission。

3. Pod Security Admission介绍

pod security admission是kubernetes内置的一种准入控制器，在kubernetes v1.23版本中这一个性门是默认开启的，在v1.22中须要通过kube-apiserver参数 --feature-gates="...,PodSecurity=true" 开启。在低于v1.22的kuberntes版本中也能够自行装置Pod Security Admission Webhook。

pod security admission是通过执行内置的 Pod Security Standards来限度集群中的pod的创立。

3.1 Pod Security Standards

为了宽泛的笼罩平安利用场景， Pod Security Standards渐进式的定义了三种不同的Pod平安规范策略：

Profile	形容
Privileged	不受限制的策略，提供最大可能范畴的权限许可。此策略容许已知的特权晋升。
Baseline	限制性最弱的策略，禁止已知的策略晋升。容许应用默认的（规定起码）Pod 配置。
Restricted	限制性十分强的策略，遵循以后的爱护 Pod 的最佳实际。

具体内容参见Pod Security Standards。

3.2 Pod Security Standards施行办法

在kubernetes集群中开启了pod security admission个性门之后，就能够通过给namespace设置label的形式来施行Pod Security Standards。其中有三种设定模式可选用：

Mode	Description
enforce	违反平安规范策略的 Pod 将被回绝。
audit	违反平安规范策略触发向审计日志中记录的事件增加审计正文，但其余行为被容许。
warn	违反平安规范策略将触发面向用户的正告，但其余行为被容许。

label设置模板解释：

# 设定模式及平安规范策略等级# MODE必须是 `enforce`, `audit`或`warn`其中之一。# LEVEL必须是`privileged`, `baseline`或 `restricted`其中之一pod-security.kubernetes.io/<MODE>: <LEVEL># 此选项是非必填的，用来锁定应用哪个版本的的平安规范# MODE必须是 `enforce`, `audit`或`warn`其中之一。# VERSION必须是一个无效的kubernetes minor version(例如v1.23)，或者 `latest`pod-security.kubernetes.io/<MODE>-version: <VERSION>

一个namesapce能够设定任意种模式或者不同的模式设定不同的平安规范策略。

通过准入控制器配置文件，能够为pod security admission设置默认配置：

apiVersion: apiserver.config.k8s.io/v1kind: AdmissionConfigurationplugins:- name: PodSecurity  configuration:    apiVersion: pod-security.admission.config.k8s.io/v1beta1    kind: PodSecurityConfiguration    # Defaults applied when a mode label is not set.    #    # Level label values must be one of:    # - "privileged" (default)    # - "baseline"    # - "restricted"    #    # Version label values must be one of:    # - "latest" (default)     # - specific version like "v1.23"    defaults:      enforce: "privileged"      enforce-version: "latest"      audit: "privileged"      audit-version: "latest"      warn: "privileged"      warn-version: "latest"    exemptions:      # Array of authenticated usernames to exempt.      usernames: []      # Array of runtime class names to exempt.      runtimeClassNames: []      # Array of namespaces to exempt.      namespaces: []

pod security admission能够从username，runtimeClassName，namespace三个维度对pod进行平安规范查看的豁免。

3.3 Pod Security Standards施行演示

环境: kubernetes v1.23

运行时的容器面临很多攻打危险，例如容器逃逸，从容器发动资源耗尽型攻打。

3.3.1 Baseline策略

Baseline策略指标是利用于常见的容器化利用，禁止已知的特权晋升，在官网的介绍中此策略针对的是利用运维人员和非关键性利用开发人员，在该策略中包含：

必须禁止共享宿主命名空间、禁止容器特权、限度Linux能力、禁止hostPath卷、限度宿主机端口、设定AppArmor、SElinux、Seccomp、Sysctls等。

上面演示设定Baseline策略。

违反Baseline策略存在的危险：

特权容器能够看到宿主机设施
挂载procfs后能够看到宿主机过程，突破过程隔离
能够突破网络隔离
挂载运行时socket后能够不受限制的与运行时通信

等等以上危险都可能导致容器逃逸。

创立名为my-baseline-namespace的namespace，并设定enforce和warn两种模式都对应Baseline等级的Pod平安规范策略：

apiVersion: v1kind: Namespacemetadata:  name: my-baseline-namespace  labels:    pod-security.kubernetes.io/enforce: baseline      pod-security.kubernetes.io/enforce-version: v1.23    pod-security.kubernetes.io/warn: baseline    pod-security.kubernetes.io/warn-version: v1.23

创立pod

创立一个违反baseline策略的pod

apiVersion: v1kind: Podmetadata:  name: hostnamespaces2  namespace: my-baseline-namespacespec:  containers:  - image: bitnami/prometheus:2.33.5    name: prometheus    securityContext:      allowPrivilegeEscalation: true      privileged: true      capabilities:        drop:        - ALL  hostPID: true  securityContext:    runAsNonRoot: true    seccompProfile:      type: RuntimeDefault

执行apply命令，显示不能设置hostPID=true，securityContext.privileged=true，Pod创立被回绝，特权容器的运行，并且开启hostPID，容器过程没有与宿主机过程隔离，容易造成Pod容器逃逸：

[root@localhost podSecurityStandard]# kubectl apply -f fail-hostnamespaces2.yamlError from server (Forbidden): error when creating "fail-hostnamespaces2.yaml": pods "hostnamespaces2" is forbidden: violates PodSecurity "baseline:v1.23": host namespaces (hostPID=true), privileged (container "prometheus" must not set securityContext.privileged=true)

创立不违反baseline策略的pod，设定Pod的hostPID=false，securityContext.privileged=false

apiVersion: v1kind: Podmetadata:  name: hostnamespaces2  namespace: my-baseline-namespacespec:  containers:  - image: bitnami/prometheus:2.33.5    name: prometheus    securityContext:      allowPrivilegeEscalation: false      privileged: false      capabilities:        drop:        - ALL  hostPID: false  securityContext:    runAsNonRoot: true    seccompProfile:      type: RuntimeDefault

执行apply命令,pod被容许创立：

[root@localhost podSecurityStandard]# kubectl apply -f pass-hostnamespaces2.yamlpod/hostnamespaces2 created

3.3.2 Restricted策略

Restricted策略指标是施行以后爱护Pod的最佳实际，在官网介绍中此策略次要针对运维人员和安全性很重要的利用开发人员，以及不太被信赖的用户。该策略蕴含所有的baseline策略的内容，额定减少：限度能够通过 PersistentVolumes 定义的非核心卷类型、禁止（通过 SetUID 或 SetGID 文件模式）取得特权晋升、必须要求容器以非 root 用户运行、Containers 不能够将 runAsUser 设置为 0、容器组必须弃用 ALL capabilities 并且只容许增加 NET_BIND_SERVICE 能力。

restricted策略进一步的限度在容器内获取root权限，linux内核性能。例如针对kubernetes网络的中间人攻打须要领有Linux零碎的CAP_NET_RAW权限来发送ARP包。

创立名为my-restricted-namespace的namespace，并设定enforce和warn两种模式都对应Restricted等级的Pod平安规范策略：

apiVersion: v1kind: Namespacemetadata:name: my-restricted-namespacelabels: pod-security.kubernetes.io/enforce: restricted  pod-security.kubernetes.io/enforce-version: v1.23 pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/warn-version: v1.23

创立pod

创立一个违反Restricted策略的pod

apiVersion: v1kind: Podmetadata:  name: runasnonroot0  namespace: my-restricted-namespacespec:  containers:  - image: bitnami/prometheus:2.33.5    name: prometheus    securityContext:      allowPrivilegeEscalation: false  securityContext:    seccompProfile:      type: RuntimeDefault

执行apply命令，显示必须设置securityContext.runAsNonRoot=true，securityContext.capabilities.drop=["ALL"]，Pod创立被回绝，容器以root用户运行时容器获取权限过大，联合没有Drop linux内核能力有kubernetes网络中间人攻打的危险：

[root@localhost podSecurityStandard]# kubectl apply -f fail-runasnonroot0.yamlError from server (Forbidden): error when creating "fail-runasnonroot0.yaml": pods "runasnonroot0" is forbidden: violates PodSecurity "restricted:v1.23": unrestricted capabilities (container "prometheus" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "prometheus" must set securityContext.runAsNonRoot=true)

创立不违反Restricted策略的pod，设定Pod的securityContext.runAsNonRoot=true，Drop所有linux能力。

apiVersion: v1kind: Podmetadata:  name: runasnonroot0  namespace: my-restricted-namespacespec:  containers:  - image: bitnami/prometheus:2.33.5    name: prometheus    securityContext:      allowPrivilegeEscalation: false      capabilities:        drop:        - ALL  securityContext:    runAsNonRoot: true    seccompProfile:      type: RuntimeDefault

执行apply命令,pod被容许创立：

[root@localhost podSecurityStandard]# kubectl apply -f pass-runasnonroot0.yamlpod/runasnonroot0 created

3.4 pod security admission以后局限性

如果你的集群中曾经配置PodSecurityPolicy，思考把它们迁徙到pod security admission是须要肯定的工作量的。

首先须要思考以后的pod security admission是否适宜你的集群，目前它旨在满足开箱即用的最常见的平安需要，与PSP相比它存在以下差别：

pod security admission 只是对pod进行平安规范的查看，不反对对pod进行批改，不能为pod设置默认的平安配置。
pod security admission 只反对官网定义的三种平安规范策略，不反对灵便的自定义平安规范策略。这使得不能齐全将PSP规定迁徙到pod security admission，须要进行具体的平安规定考量。
pod security admission 不像PSP一样能够与具体的用户进行绑定，只反对豁免特定的用户或者RuntimeClass及namespace。

4. pod security admission源码剖析

kubernetes准入控制器是在代码层面与API server逻辑解耦的插件，对象被创立、更新、或删除在etcd长久化之前能够对申请进行拦挡执行特定的逻辑。一个申请到API server经典的流程如下图所示：

Api Request 解决流程图

4.1 源码主体逻辑流程图

podsecurityAdmission 代码流程图

pod security admission主体逻辑流程如图所示，准入控制器首先解析拦挡到的申请，依据解析到的资源类型进行不同的逻辑解决：

Namespace : 如果解析到的资源是Namespace，准入控制器先依据该namesapce的labels解析出配置平安规范策略的等级、模式及锁定的Pod平安规范策略版本等信息。查看如果过不蕴含Pod平安规范策略信息则间接容许申请通过，如果蕴含Pod平安规范策略信息则判断是create新的namespace,还是update旧的namespace,如果是create则判断配置是否正确，如果是update 则评估namespace中的pod是否合乎新设定的平安规范策略。
Pod: 如果解析到的资源是Pod，准入控制器先获取该Pod所处的namespace设定的Pod平安规范策略信息，如果该namespace未设定Pod平安规范策略则容许申请通过，否则评估该Pod是否合乎平安规范策略。
others：准入控制器先获取该资源所处的namespace设定的Pod平安策略信息，如果该namespace未设定Pod安全策略则容许申请通过，否则进一步解析该资源判断该资源是否是诸如PodTemplate，ReplicationController，ReplicaSet，Deployment，DaemonSet，StatefulSet，Job，CronJob等蕴含PodSpec的资源，解析出PodSpec后评估该资源是否合乎Pod安全策略。

4.2 初始化 Pod security admission

像大多数go程序一样，Pod security admission应用github.com/spf13/cobra创立了启动命令，在启动调用runServer初始化并启动webhook服务。入参Options中蕴含了DefaultClientQPSLimit，DefaultClientQPSBurst，DefaultPort，DefaultInsecurePort等默认配置。

// NewSchedulerCommand creates a *cobra.Command object with default parameters and registryOptionsfunc NewServerCommand() *cobra.Command {    opts := options.NewOptions()    cmdName := "podsecurity-webhook"    if executable, err := os.Executable(); err == nil {        cmdName = filepath.Base(executable)    }    cmd := &cobra.Command{        Use: cmdName,        Long: `The PodSecurity webhook is a standalone webhook server implementing the PodSecurity Standards.`,        RunE: func(cmd *cobra.Command, _ []string) error {            verflag.PrintAndExitIfRequested()            // 初始化并且启动webhook服务            return runServer(cmd.Context(), opts)        },        Args: cobra.NoArgs,    }    opts.AddFlags(cmd.Flags())    verflag.AddFlags(cmd.Flags())    return cmd}

runserver函数中加载了准入控制器的配置，初始化了server, 最终启动server。

func runServer(ctx context.Context, opts *options.Options) error {    // 加载配置内容    config, err := LoadConfig(opts)    if err != nil {        return err    }    // 依据配置内容初始化server    server, err := Setup(config)    if err != nil {        return err    }        ctx, cancel := context.WithCancel(ctx)    defer cancel()    go func() {        stopCh := apiserver.SetupSignalHandler()        <-stopCh        cancel()    }()    // 启动server    return server.Start(ctx)}

上面截取了Setup函数局部次要代码片段，Setup函数创立了Admission对象蕴含:

PodSecurityConfig: 准入控制器配置内容，包含默认的Pod平安规范策略等级及设定模式和锁定对应kubernetes版本，以及豁免的Usernames、RuntimeClasses和Namespaces。
Evaluator: 创立的评估器，即定义了查看平安规范策略的具体方法。
Metrics: 用于收集Prometheus指标。
PodSpecExtractor：用解析申请对象中的PodSpec。
PodLister: 用于获取指定namespace中的Pods。
NamespaceGetter：用户获取拦挡到申请中的资源所处的namespace。

// Setup creates an Admission object to handle the admission logic.func Setup(c *Config) (*Server, error) {    ...    s.delegate = &admission.Admission{        Configuration:    c.PodSecurityConfig,        Evaluator:        evaluator,        Metrics:          metrics,        PodSpecExtractor: admission.DefaultPodSpecExtractor{},        PodLister:        admission.PodListerFromClient(client),        NamespaceGetter:  admission.NamespaceGetterFromListerAndClient(namespaceLister, client),    }   ...    return s, nil}

准入控制器服务启动之后注册了HandleValidate办法进行准入测验逻辑的解决,在此办法中调用Validate办法进行具体Pod平安规范策略的测验。

//解决webhook拦挡到的申请func (s *Server) HandleValidate(w http.ResponseWriter, r *http.Request) {    defer utilruntime.HandleCrash(func(_ interface{}) {        // Assume the crash happened before the response was written.        http.Error(w, "internal server error", http.StatusInternalServerError)    })     ...    // 进行具体的测验操作    response := s.delegate.Validate(ctx, attributes)    response.UID = review.Request.UID // Response UID must match request UID    review.Response = response    writeResponse(w, review)}

4.3 准入测验解决逻辑

Validate办法依据获取申请蕴含的不同资源类型调用不同的测验办法进行具体的测验操作，以下三种解决方向最终都会调用EvaluatePod办法，对Pod进行平安规范策略评估。

// Validate admits an API request.// The objects in admission attributes are expected to be external v1 objects that we care about.// The returned response may be shared and must not be mutated.func (a *Admission) Validate(ctx context.Context, attrs api.Attributes) *admissionv1.AdmissionResponse {    var response *admissionv1.AdmissionResponse    switch attrs.GetResource().GroupResource() {    case namespacesResource:        response = a.ValidateNamespace(ctx, attrs)    case podsResource:        response = a.ValidatePod(ctx, attrs)    default:        response = a.ValidatePodController(ctx, attrs)    }    return response}

EvaluatePod办法中对namespace设定平安规范策略和版本进行判断，从而选取不同的查看办法对Pod进行安全性测验。

func (r *checkRegistry) EvaluatePod(lv api.LevelVersion, podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) []CheckResult {    // 如果设定的Pod平安规范策略等级是Privileged（宽松的策略）间接返回    if lv.Level == api.LevelPrivileged {        return nil    }    // 如果注册的查看策略最大版本号低于namespace设定策略版本号，则应用注册的查看策略的最大版本号    if r.maxVersion.Older(lv.Version) {        lv.Version = r.maxVersion    }    var checks []CheckPodFn    // 如果设定的Pod平安规范策略等级是Baseline    if lv.Level == api.LevelBaseline {        checks = r.baselineChecks[lv.Version]    } else {        // includes non-overridden baseline checks        // 其余走严格的Pod平安规范策略查看        checks = r.restrictedChecks[lv.Version]    }    var results []CheckResult    // 遍历查看办法，返回查看后果    for _, check := range checks {        results = append(results, check(podMetadata, podSpec))    }    return results}

上面截取一个具体的测验办法来看一下是如何进行pod平安规范查看的，如下查看了Pod中的容器是否敞开了allowPrivilegeEscalation，AllowPrivilegeEscalation设置容器内的子过程是否能够晋升权限，通常在设置非root用户（MustRunAsNonRoot）时进行设置。

func allowPrivilegeEscalation_1_8(podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) CheckResult {    var badContainers []string    visitContainers(podSpec, func(container *corev1.Container) {        // 查看pod中容器平安上下文是否配置，AllowPrivilegeEscalation是否配置，及AllowPrivilegeEscalation是否设置为false.        if container.SecurityContext == nil || container.SecurityContext.AllowPrivilegeEscalation == nil || *container.SecurityContext.AllowPrivilegeEscalation {            badContainers = append(badContainers, container.Name)        }    })    if len(badContainers) > 0 {        // 存在违反Pod平安规范策略的内容，则返回具体后果信息        return CheckResult{            Allowed:         false,            ForbiddenReason: "allowPrivilegeEscalation != false",            ForbiddenDetail: fmt.Sprintf(                "%s %s must set securityContext.allowPrivilegeEscalation=false",                pluralize("container", "containers", len(badContainers)),                joinQuote(badContainers),            ),        }    }    return CheckResult{Allowed: true}}

总结

在 kubernetes v1.23版本中 Pod Security Admission曾经降级到beta版本，尽管目前性能不算弱小，但该个性将来可期。