背景

因为k8s集群的资源有余,将局部服务器关机扩容再启动,发现局部pod始终没有起来。

排查过程

查看pod的日志,发现解析集群内的域名失败,揣测是dns的问题。

查看dns相干的pod,发现nodelocaldns始终在重启。

root@master1:~# k get pods --all-namespaces | grep dnskube-system                    coredns-74d59cc5c6-9brnf                            1/1     Running                 0          15dkube-system                    coredns-74d59cc5c6-g46rf                            1/1     Running                 0          15dkube-system                    nodelocaldns-bwnml                                  1/1     Running                 0          15dkube-system                    nodelocaldns-f8tmj                                  0/1     CrashLoopBackOff        2926       12dkube-system                    nodelocaldns-rtngg                                  0/1     CrashLoopBackOff        44         15d

登录对应的node,netstat -ntple | grep 53查看53端口的占用状况,发现被named过程占用。

将named过程kill掉,在/lib/systemd/system目录下定位到是bind9服务开机自启动,并将其禁用掉,实现修复工作。

root@node1:/lib/systemd/system# grep -rn named ././bind9-pkcs11.service:3:Documentation=man:named(8)./bind9-pkcs11.service:8:Environment=KRB5_KTNAME=/etc/bind/named.keytab./bind9-pkcs11.service:10:ExecStart=/usr/sbin/named-pkcs11 -f -u bind./bind9-resolvconf.service:3:Documentation=man:named(8) man:resolvconf(8)./bind9-resolvconf.service:11:ExecStart=/bin/sh -c 'echo nameserver 127.0.0.1 | /sbin/resolvconf -a lo.named'./bind9-resolvconf.service:12:ExecStop=/sbin/resolvconf -d lo.named./systemd-hostnamed.service:12:Documentation=man:systemd-hostnamed.service(8) man:hostname(5) man:machine-info(5)./systemd-hostnamed.service:13:Documentation=https://www.freedesktop.org/wiki/Software/systemd/hostnamed./systemd-hostnamed.service:16:ExecStart=/lib/systemd/systemd-hostnamed./bind9.service:3:Documentation=man:named(8)./bind9.service:10:ExecStart=/usr/sbin/named -f $OPTIONSroot@node1:/lib/systemd/system# systemctl disable bind9Synchronizing state of bind9.service with SysV service script with /lib/systemd/systemd-sysv-install.Executing: /lib/systemd/systemd-sysv-install disable bind9