关于kubernetes:Kubernetes宕机切换源码分析

32次阅读

共计 4200 个字符，预计需要花费 11 分钟才能阅读完成。

K8s 对于 kubelet 宕机迁徙的解决在不同的版本有不同的演进，所以网上很多文章对于如何放慢这个工夫的说法并不统一，甚至有些检索进去没什么用途。

晚期搜寻到一些文章，指定了一个要害参数 pod-eviction-timeout，驱赶 pod 的等待时间，可是发现批改该参数有效，通过浏览源码，发现并没有应用到这个参数，狐疑是一个废除的参数，通过翻阅很多材料后，发现不同的版本，是有不同的驱赶逻辑的。

< 小于 1.13 版本：没有启用污点管理器个性时，Pod 的迁徙由以下四个参数决定，
- node-status-update-frequency, 节点上报频率，默认为 10s
- node-monitor-period , node 控制器每隔多长时间监控一次 node 状态，默认为 5s
- node-monitor-grace-period，node 控制器距离多长时间后会将 Node 设置为 Not Ready，默认为 40s
- pod-eviction-timeout, node 控制器距离多长时间后开始驱赶 Pod
版本大于等于 1.14 小于 1.18：默认启用污点管理器个性，通过污点管理器的机制驱赶 Pod
版本大于 1.18：必须启动污点管理器，其实旧的代码曾经没有意义了

官网文档

节点亲和性是 Pod 的一种属性，它使 Pod 被吸引到一类特定的节点（这可能出于一种偏好，也可能是硬性要求）。

污点（Taint） 则相同——它使节点可能排挤一类特定的 Pod。

容忍度（Toleration） 是利用于 Pod 上的。容忍度容许调度器调度带有对应污点的节点。容忍度容许调度但并不保障调度：作为其性能的一部分，调度器也会评估其余参数。

污点和容忍度（Toleration）相互配合，能够用来防止 Pod 被调配到不适合的节点上。每个节点上都能够利用一个或多个污点，这示意对于那些不能容忍这些污点的 Pod，是不会被该节点承受的。

简略来说，依照污点和容忍的机制思考，所有对于 Pod 的驱赶，都能够实用这套机制，包含因为 kubelet 故障导致的。

Node 控制器会周期性查看 node 的状态，如果发现有心跳工夫超过了 node-monitor-grace-period 的，就认为是不可达了，将给该节点赋予 Taint.

# node_lifecycle_controller.go
monitorNodeHealth()
    // 1. 获取所有的 node
    --> nodes, err := nc.nodeLister.List(labels.Everything())
    // 2. 依据心跳工夫判断是否呈现了 Not Ready
    --> gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeHealth(node)
    // 3. 为 node 设置 taint
    --> nc.processTaintBaseEviction(node, &observedReadyCondition)

一旦 Node 被赋予了 Taint，那么曾经注册在 NodeLifecycleController 中的 nodeInformer 就会监听到该事件，并将该 node 对象传入 tc.nodeUpdateChannels；

tc 是 NoExecuteTaintManager 污点治理对象，它会监听 tc.nodeUpdateChannels，将 node 传给办法 tc.handleNodeUpdate，而后查问 node 中的所有 Pod，调用 tc.processPodOnNode 办法进行解决；

processPodOnNode 会创立一个 TimedWorker 对象，这是一个具备定时执行能力的对象，当工夫到了就会调用指定的办法：deletePodHandler，对 Pod 进行驱赶。

那么 TimedWorker 的定时工夫是多少呢，污点管理器会求一个 minTolerationTime, 也就是最小容忍工夫。这个容忍工夫会找到 pod.Spec.Tolerations 中的容忍工夫。

那么 Pod 中的这个容忍工夫是什么时候写入的呢？

咱们执行 kubectl describe pod xxx，会发现 Pod 中曾经写入了一个针对污点 node.kubernetes.io/not-ready:NoExecute 和 node.kubernetes.io/unreachable 的容忍，并且指定了容忍工夫为 300s。

Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

通过查阅材料，发现 Pod 中的默认驱赶污点是 API-Server 设置的。

Kubernetes 会主动给 Pod 增加针对 node.kubernetes.io/not-ready 和 node.kubernetes.io/unreachable 的容忍度，且配置 tolerationSeconds=300，除非用户本身或者某控制器显式设置此容忍度。

这些主动增加的容忍度意味着 Pod 能够在检测到对应的问题之一时，在 5 分钟内放弃绑定在该节点上。

kube-apiserver 参数片段

–default-not-ready-toleration-seconds int Default: 300

Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration.

–default-unreachable-toleration-seconds int Default: 300

Indicates the tolerationSeconds of the toleration for unreachable:NoExecute that is added by default to every pod that does not already have such a toleration.

plugin/pkg/admission/defaulttolerationseconds/admission.go:43

var (
    defaultNotReadyTolerationSeconds = flag.Int64("default-not-ready-toleration-seconds", 300,
        "Indicates the tolerationSeconds of the toleration for notReady:NoExecute"+
            "that is added by default to every pod that does not already have such a toleration.")

    defaultUnreachableTolerationSeconds = flag.Int64("default-unreachable-toleration-seconds", 300,
        "Indicates the tolerationSeconds of the toleration for unreachable:NoExecute"+
            "that is added by default to every pod that does not already have such a toleration.")

    notReadyToleration = api.Toleration{
        Key:               v1.TaintNodeNotReady,
        Operator:          api.TolerationOpExists,
        Effect:            api.TaintEffectNoExecute,
        TolerationSeconds: defaultNotReadyTolerationSeconds,
    }

    unreachableToleration = api.Toleration{
        Key:               v1.TaintNodeUnreachable,
        Operator:          api.TolerationOpExists,
        Effect:            api.TaintEffectNoExecute,
        TolerationSeconds: defaultUnreachableTolerationSeconds,
    }
)

// Admit makes an admission decision based on the request attributes
func (p *Plugin) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {
......
    if !toleratesNodeNotReady {pod.Spec.Tolerations = append(pod.Spec.Tolerations, notReadyToleration)
      }

      if !toleratesNodeUnreachable {pod.Spec.Tolerations = append(pod.Spec.Tolerations, unreachableToleration)
      }
......
}

应该是 api-server 准入判断中减少的逻辑，默认给 pod 减少了容忍污点。

通过查阅材料和源码，总算搞清楚了 Pod 的宕机驱赶逻辑实现，堪称是天马行空、羚羊挂角，从 kubelet 的心跳到 controler-manager 中的 node 控制器的监听，再到 api-server 对 pod 的默认污点，还蕴含 scheduler 不再调度到该 node 的设定，根本涵盖了所有的管制组件了。

并且其中大量应用 channel，队列，解耦做的十分彻底，然而源码的浏览也减少了不少艰难。社区随着版本迭代也在一直的对代码进行优化，重构，摸清 k8s 的实现机制，是一个乏味且富裕挑战的工作。

K8S 节点不可用时疾速迁徙 Pods

Pod 容忍节点异样工夫调整

正文完

kubernetes

发表至： kubernetes

2022-08-15

0

关于kubernetes:扎根CNCF社区贡献五年是怎样的体验听听华为云原生开源团队的负责人怎么说

根治Kubernetes存储头痛症的方法

关于kubernetes:这个云原生开发的痛点你遇到了吗

关于kubernetes:Notes-of-Kubernetes

关于区块链:如何打造单品爆款开利网络助力安贝德建立免费换新胎分红收益机制

关于kubernetes:Kubernetes宕机切换源码分析

污点机制介绍

源码剖析

1. 将 node 设置为 Not Ready

2. 监听 Node 更新事件，触发驱赶

3. 默认容忍工夫

4. API-Server 配置默认容忍

总结

参考资料

Just My Socks（注册教程内含优惠码）

关于kubernetes:Kubernetes宕机切换源码分析

污点机制介绍

源码剖析

1. 将 node 设置为 Not Ready

2. 监听 Node 更新事件，触发驱赶

3. 默认容忍工夫

4. API-Server 配置默认容忍

总结

参考资料

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）