Kublet PLEG不健康问题排障

jiezi

6 年前

环境：Rancher 管控的 K8S 集群。
现象：某个 Node 频繁出现“PLEG is not healthy: pleg was last seen active 3m46.752815514s ago; threshold is 3m0s”错误，频率在 5 -10 分钟就会出现一次。
排查：

kubectl get pods –all-namespaces 发现有一个 istio-ingressgateway-6bbdd58f8c-nlgnd 一直处于 Terminating 状态，也就是说杀不死。

到 Node 上 docker logs –tail 100 kubelet 也看到这个 Pod 的状态异常：
I0218 01:21:17.383650 10311 kubelet.go:1775] skipping pod synchronization – [PLEG is not healthy: pleg was last seen active 3m46.752815514s ago; threshold is 3m0s]
…
E0218 01:21:30.654433 10311 generic.go:271] PLEG: pod istio-ingressgateway-6bbdd58f8c-nlgnd/istio-system failed reinspection: rpc error: code = DeadlineExceeded desc = context deadline exceeded

用 kubelet delete pod 尝试删除，命令挂住。
用 kubectl delete pod –force –grace-period=0，强制删除 Pod。
再到 Node 上检查这个容器是否真的被停止，docker ps -a| grep ingressgateway-6bbdd58f8c-nlgnd，看到容器处于 Exited 状态。
观察 Node 状态，问题依旧。
把 Pod 关联的 Deployment 删除，把一只处于 Terminating 的 Pod 用 kubectl delete pod –force –grace-period= 0 的方式删除。
重新部署 Deployment。
问题解决。