关于云原生-cloud-native:OpenKruise解放-DaemonSet-运维之路

作者 | 王思宇（酒祝）

OpenKruise 是阿里云开源的大规模利用自动化治理引擎，在性能上对标了 Kubernetes 原生的 Deployment/StatefulSet 等控制器，但 OpenKruise 提供了更多的加强性能，如：优雅原地降级、公布优先级 / 打散策略、多可用区 workload 形象治理、对立 sidecar 容器注入治理等，都是经验了阿里巴巴超大规模利用场景打磨出的外围能力。这些 feature 帮忙咱们应答更加多样化的部署环境和需要、为集群维护者和利用开发者带来更加灵便的部署公布组合策略。

目前在阿里巴巴外部云原生环境中，利用全副对立应用 OpenKruise 的能力做 Pod 部署、公布治理，而不少业界公司和阿里云上的客户因为 K8s 原生 Deployment 等负载不能齐全满足需要，也转而采纳 OpenKruise 作为利用部署载体。咱们心愿 OpenKruise 让每一位 Kubernetes 开发者和阿里云上的用户都能便捷地应用上阿里巴巴外部云原生利用所对立应用的部署公布能力！

如何在 Kubernetes 集群中部署节点组件呢？置信大家对 DaemonSet 并不生疏，它可能帮忙咱们将定义好的 Pod 部署到所有符合条件的 Node 上，这大大加重了过来咱们保护节点上各类守护过程的苦楚。

在阿里巴巴外部的云原生环境中，存在不少网络、存储、GPU、监控等等相干的节点组件都是通过 DaemonSet 部署治理的。然而随着近两年 Kubernetes 集群规模越来越大，所有外围业务逐步全量上云原生之后，咱们越发感触到原生 DaemonSet 很难满足大规模、高可用的简单场景需要。

大家能够了解为原生的 DaemonSet 的确解决了 0 -> 1 的问题，防止了间接治理 Node 上各类软件包和守护过程的难题，能做到用统一化的 Pod 来部署节点组件。然而在部署之后呢？咱们面临的是 1 -> N 的一直迭代降级的问题了，而在降级能力方面，原生 DaemonSet 做的切实有些草草了事的感觉。

apiVersion: apps/v1
kind: DaemonSet
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 2
  # ...

apiVersion: apps/v1
kind: DaemonSet
spec:
  updateStrategy:
    type: OnDelete
  # ...

以上是原生 DaemonSet 反对的两种降级形式。置信少数人应用 DaemonSet 根本都是默认的 RollingUpdate 滚动降级，这自身是没问题的，问题就在于滚动降级时只反对了 maxUnavailable 一个策略，这就让咱们很难承受了。目前阿里巴巴内的 Kubernetes 不少曾经做到单集群上万节点，这些节点可能有不同的机型、拓扑、外围水平、内核版本等等，而 DaemonSet 降级也笼罩到这上万节点上的 daemon Pod、波及所有节点上的利用 Pod。

面对如此简单和规模化的环境，原生 DaemonSet 没有灰度、没有分批、没有暂停、没有优先级，仅仅用一个 maxUnavailable 策略显然是无奈满足的。要晓得 daemon Pod 即便配置了 readinessProbe 往往也只能查看容器内过程是否启动运行，而对于过程的运行状况很难考量。

因而，即便 DaemonSet 公布了一个代码有 bug 的版本，只有过程能失常启动则 maxUnavailable 策略就无奈爱护，DaemonSet 会始终公布上来；如果降级开始了一段时间后才发现问题，那此时很可能故障范畴就曾经笼罩到整个集群了。

为了防止这个问题，咱们已经一度改为应用 OnDelete 策略、在公布平台上管制公布程序和分批，但终态上咱们还是心愿将 workload 的能力下沉偿还到 workload，造成闭环，防止将残缺的能力扩散到多个模块。因而随着 OpenKruise 的成熟和在阿里内外的铺开，咱们总结了外部对 DaemonSet 的通用化公布需要、将其积淀到 OpenKruise 中，称之为 Advanced DaemonSet。

目前阿里巴巴和蚂蚁团体外部的大部分 DaemonSet 都曾经对立到 Advanced DaemonSet 部署治理，并且随着 OpenKruise v0.6.0 版本的推出之后，内部一些公司如位于以色列的 Bringg 都曾经开始对接应用。

Advanced DaemonSet 中次要减少的 API 字段如下：

const (
+    // StandardRollingUpdateType replace the old daemons by new ones using rolling update i.e replace them on each node one after the other.
+    // this is the default type for RollingUpdate.
+    StandardRollingUpdateType RollingUpdateType = "Standard"

+    // SurgingRollingUpdateType replaces the old daemons by new ones using rolling update i.e replace them on each node one
+    // after the other, creating the new pod and then killing the old one.
+    SurgingRollingUpdateType RollingUpdateType = "Surging"
)

// Spec to control the desired behavior of daemon set rolling update.
type RollingUpdateDaemonSet struct {
+    // Type is to specify which kind of rollingUpdate.
+    Type RollingUpdateType `json:"rollingUpdateType,omitempty" protobuf:"bytes,1,opt,name=rollingUpdateType"`

    // ...
    MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,2,opt,name=maxUnavailable"`

+    // A label query over nodes that are managed by the daemon set RollingUpdate.
+    // Must match in order to be controlled.
+    // It must match the node's labels.
+    Selector *metav1.LabelSelector `json:"selector,omitempty" protobuf:"bytes,3,opt,name=selector"`

+    // The number of DaemonSet pods remained to be old version.
+    // Default value is 0.
+    // Maximum value is status.DesiredNumberScheduled, which means no pod will be updated.
+    // +optional
+    Partition *int32 `json:"partition,omitempty" protobuf:"varint,4,opt,name=partition"`

+    // Indicates that the daemon set is paused and will not be processed by the
+    // daemon set controller.
+    // +optional
+    Paused *bool `json:"paused,omitempty" protobuf:"varint,5,opt,name=paused"`

+    // Only when type=SurgingRollingUpdateType, it works.
+    // The maximum number of DaemonSet pods that can be scheduled above the desired number of pods
+    // during the update. Value can be an absolute number (ex: 5) or a percentage of the total number
+    // of DaemonSet pods at the start of the update (ex: 10%). The absolute number is calculated from
+    // the percentage by rounding up. This cannot be 0. The default value is 1. Example: when this is
+    // set to 30%, at most 30% of the total number of nodes that should be running the daemon pod
+    // (i.e. status.desiredNumberScheduled) can have 2 pods running at any given time. The update
+    // starts by starting replacements for at most 30% of those DaemonSet pods. Once the new pods are
+    // available it then stops the existing pods before proceeding onto other DaemonSet pods, thus
+    // ensuring that at most 130% of the desired final number of DaemonSet  pods are running at all
+    // times during the update.
+    // +optional
+    MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,7,opt,name=maxSurge"`
}

type DaemonSetSpec struct {
    // ...

+    // BurstReplicas is a rate limiter for booting pods on a lot of pods.
+    // The default value is 250
+    BurstReplicas *intstr.IntOrString `json:"burstReplicas,omitempty" protobuf:"bytes,5,opt,name=burstReplicas"`
}

在一个大规模 Kubernetes 集群中往往存在很多种差异化的节点类型，比方机型、拓扑、外围水平、内核版本等，因而在 DaemonSet 公布的时候咱们反对依据 Node 的标签来匹配公布哪些 Node 上的 Pod。

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  # ...
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      selector:
        matchLabels:
          nodeType: canary

比方上述配置了滚动降级下的 selector 策略，则 DaemonSet 只会在合乎 selector 条件的 Node 上把 Pod 做滚动降级。如果 selector 扭转，则 DaemonSet 会依照新的 selector 做降级，对曾经是最新版本的 Pod 不会做变动。

因而，用户能够通过屡次批改 selector，来实现不同类型 Node 的前后公布程序。这个优先程序能够是特定一批用于灰度的非核心节点，也能够是一些逻辑资源池等。

如果说你不关怀节点类型，Advanced DaemonSet 同样提供了按数量灰度的能力：

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  # ...
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 100

这里的 partition 和 OpenKruise 中其余 CloneSet、Advanced StatefulSet 相似，都示意了维持旧版本的数量，也就是说 Kruise 控制器会抉择 status.DesiredNumberScheduled - partition 数量的 Pod 滚动降级为新版本。

比方以后集群中 DaemonSet 部署的节点数量是 120 个，当滚动降级时如果设置了 partition 为 100，则 DaemonSet 只会抉择 20 个 Pod 滚动到新版本。只有当用户再次下调 partition，DaemonSet 才会持续按要求数量来持续降级。

上述两种灰度策略置信都不难理解，那么如果同时配置了按节点和按数量两种灰度策略，会怎么样呢？

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  # ...
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 5
      partition: 100
      selector:
        matchLabels:
          nodeType: canary

想搞清楚这个问题，其实看懂 Advanced DaemonSet 的公布策略计算逻辑就很好了解了，有趣味的同学能够跳去看一下：https://github.com/openkruise/kruise/blob/master/pkg/controller/daemonset/update.go#L459

参考下面这个 YAML，如果用户同时配置了 partition 和 selector，那么控制器在公布的时候会先依照 selector 匹配符合条件的 Node，再依照 partition 计算其中可能公布的数量。当然，如果你还配置了原生 DaemonSet 就反对的 maxUnavailable，那么最初还会依照 unavailable 的数量再次限度理论能滚动降级的数量。

简略来说，最终真正执行滚动降级的 Pod，肯定是要同时满足所有配置的灰度策略。

规范的 DaemonSet 滚动降级过程，是通过先删除旧 Pod、再创立新 Pod 的形式来做的。在绝大部分场景下这样的形式都是能够满足的，然而如果这个 daemon Pod 的作用还须要对外提供服务，那么滚动的时候可能对应 Node 上的服务就不可用了。

为了提供高可用能力，咱们对 DaemonSet 也提供了 surging 公布策略。（回顾一下原生 Deployment 或者 OpenKruise 的 CloneSet，在这些面向无状态服务的 workload 中如果配置了 maxSurging，则公布时会先多扩进去 maxSurging 数量的 Pod，再逐步删掉旧版本的 Pod。）

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  # ...
  updateStrategy:
    rollingUpdate:
      type: Surging  # defaults to Standard
      maxSurge: 30%

首先，在滚动降级中配置 type: Surging，这个类型默认是 Standard — 也就是先删再扩，而一旦设置为 Surging 则变为先扩再缩。也就是在滚动降级时，DaemonSet 会先在要公布的 Node 上新建一个 Pod，等这个新版本 Pod 变为 ready 之后再把旧版本 Pod 删除掉。

另外在流式的策略上，maxUnavailable 是用于 Standard 类型的，对应了在滚动降级时最多在多少个 Node 上删除 Pod。而 maxSurge 策略是用于 Surging 类型的，对应了在滚动降级时最多在多少个 Node 上多扩出一个 Pod。

此外，Advanced DaemonSet 还反对了 paused 一键暂停公布。这个比拟好了解，就不细表述了。

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  # ...
  updateStrategy:
    rollingUpdate:
      paused: true

总的来看，OpenKruise 在原生 DaemonSet 根底上减少了一系列面向生产场景的公布策略，让 DaemonSet 的降级过程更加平安、可控、自动化。

后续 OpenKruise 还会继续在利用部署 / 公布能力上做出更深的优化，咱们也欢送每一位云原生爱好者来独特参加 OpenKruise 的建设。与其余一些开源我的项目不同，OpenKruise 并不是阿里外部代码的复刻；恰恰相反，OpenKruise Github 仓库是阿里外部代码库的 upstream。因而，每一行你奉献的代码，都将运行在阿里外部的所有 Kubernetes 集群中、都将独特撑持了阿里巴巴寰球顶尖规模的云原生利用场景！

“阿里巴巴云原生关注微服务、Serverless、容器、Service Mesh 等技术畛域、聚焦云原生风行技术趋势、云原生大规模的落地实际，做最懂云原生开发者的公众号。”

关于云原生-cloud-native:OpenKruise解放-DaemonSet-运维之路

前言

背景

能力解析

按节点灰度

按数量灰度

多维度灰度

热降级

公布暂停

总结