乐趣区

关于kubernetes:k8s环境中监控不通问题排查思路

k8s 中的监控原理、prometheus 采集原理 能够看这个文章

k8s 监控指标汇总,prometheus 采集 k8s 原理解析

k8s-mon 我的项目介绍

我的项目地址

  • https://github.com/n9e/k8s-mon

    视频介绍

  • https://www.bilibili.com/vide…

kube-stats-metrics 没数据

排查思路 dns 问题

首先察看 k8s-mon-deployment 的日志


kubectl logs  -l app=k8s-mon-deployment  -n kube-admin |grep 8080

# 如有下列报错阐明网络不通
# err="Get \"http://kube-state-metrics.kube-system:8080/metrics\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

排查 dns,在 node 上申请 coredns 服务



root@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl get svc  -n kube-system |grep dns
kube-dns             ClusterIP   10.96.0.10     <none>        53/UDP,53/TCP,9153/TCP   73d

在 node 上申请 coredns 解析 kube-stats-metrics 域名

# 10.96.0.10  为申请到的 coredns svc 地址 
# 因为 node 上的搜寻域没有 svc.cluster.local,所以须要 FQDN
dig kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10 
# 如果失常的话则会有如下 A 记录 

root@k8s-local-test-01:~$ dig kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12799
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;kube-state-metrics.kube-system.svc.cluster.local. IN A

;; ANSWER SECTION:
kube-state-metrics.kube-system.svc.cluster.local. 25 IN    A 10.100.30.129

;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Fri Apr 02 15:14:46 CST 2021
;; MSG SIZE  rcvd: 141

在 node 上申请 kube-stats-metrics

root@k8s-local-test-01:/etc/kubernetes/manifests$ curl -s 10.100.30.129:8080/metrics |head 
# HELP kube_certificatesigningrequest_labels Kubernetes labels converted to Prometheus labels.
# TYPE kube_certificatesigningrequest_labels gauge
# HELP kube_certificatesigningrequest_created Unix creation timestamp
# TYPE kube_certificatesigningrequest_created gauge
# HELP kube_certificatesigningrequest_condition The number of each certificatesigningrequest condition
# TYPE kube_certificatesigningrequest_condition gauge
# HELP kube_certificatesigningrequest_cert_length Length of the issued cert
# TYPE kube_certificatesigningrequest_cert_length gauge
# HELP kube_configmap_info Information about configmap.
# TYPE kube_configmap_info gauge

如果有输入证实 node 上申请 coredns 没问题,申请 ksm 服务也没问题

而后进入 k8s-mon-deployment 容器中查看

# 进入容器命令
kubectl -n kube-admin  exec "$(kubectl -nkube-admin get pod -l app=k8s-mon-deployment -o jsonpath='{.items[0].metadata.name}')"  -ti -- /bin/sh 

# ping 一下 kube-state-metrics.kube-system
PING kube-state-metrics.kube-system (10.100.30.129): 56 data bytes
64 bytes from 10.100.30.129: seq=0 ttl=64 time=0.097 ms
64 bytes from 10.100.30.129: seq=1 ttl=64 time=0.093 ms
64 bytes from 10.100.30.129: seq=2 ttl=64 time=0.114 ms
64 bytes from 10.100.30.129: seq=3 ttl=64 time=0.124 ms

# wget 申请一下 ksm 服务 

/ # wget http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics -O m |head  m 
# HELP kube_certificatesigningrequest_labels Kubernetes labels converted to Prometheus labels.
# TYPE kube_certificatesigningrequest_labels gauge
# HELP kube_certificatesigningrequest_created Unix creation timestamp
# TYPE kube_certificatesigningrequest_created gauge
# HELP kube_certificatesigningrequest_condition The number of each certificatesigningrequest condition
# TYPE kube_certificatesigningrequest_condition gauge
# HELP kube_certificatesigningrequest_cert_length Length of the issued cert
# TYPE kube_certificatesigningrequest_cert_length gauge
# HELP kube_configmap_info Information about configmap.
# TYPE kube_configmap_info gauge
Connecting to kube-state-metrics.kube-system.svc.cluster.local:8080 (10.100.30.129:8080)

如果在 node 上能够获取到,但在 pod 中获取不到思考 coredns 有问题或者容器网络 有问题

打印 coredns 日志

oot@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl logs  -l k8s-app=kube-dns  -n kube-system -f 
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
 
root@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl logs  -l k8s-app=kube-dns  -n kube-system |grep -i error

容器网络问题 能够依照这个文档排查 https://juejin.cn/post/684490…

apiserver 等服务组件没数据

排查思路 先看看日志 报什么错

在 node 上手动 带 token 申请下 apiserver 的 metrics

TOKEN=$(kubectl -n kube-admin  get secret $(kubectl -n kube-admin  get serviceaccount k8s-mon -o jsonpath='{.secrets[0].name}') -o jsonpath='{.data.token}' | base64 --decode ) 
curl  https://localhost:6443/metrics --header "Authorization: Bearer $TOKEN" --insecure 

# 如果失常的话能够看到 metrics 数据 

服务组件没有部署在 pod 中的须要在 configMap 中给出地址 并设置 user_specified:true

  kube_scheduler:
    user_specified: true
    addrs:
      - "https://1.1.1.1:1234/metrics"
      - "https://2.2.2.2:1234/metrics"

日志中报 push 到夜莺 agent 的谬误

level=error ts=2021-04-02T14:44:21.560+08:00 caller=push.go:79 msg=HttpPostPushDataBuildNewHttpPostReqError2 funcName=api-server url=http://localhost:2080/api/collector/push err="Post \"http://localhost:2080/api/collector/push\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
 

能够将 k8s-mon 的日志改为 debug 查看下

  command:
    - /opt/app/k8s-mon
    - --config.file=/etc/k8s-mon/k8s-mon.yml
    - --log.level=debug

debug 日志会打印每个 阶段的耗时

level=debug ts=2021-04-02T16:19:20.723+08:00 caller=kube_state_metrics.go:244 msg=DoCollectSuccessfullyReadyToPush funcName=kube-stats-metrics metrics_num=3551 time_took_seconds=0.276154232 metric_addr=http://kube-state-metrics.kube-system:8080/metrics
level=debug ts=2021-04-02T16:19:20.733+08:00 caller=kube_controller_manager.go:180 msg=DoCollectSuccessfullyReadyToPush funcName=kube-controller-manager metrics_num=642 time_took_seconds=0.286183625
level=debug ts=2021-04-02T16:19:20.845+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=kube-controller-manager url=http://localhost:2080/api/collector/push metricsNum=642 time_took_seconds=0.111731185
level=debug ts=2021-04-02T16:19:20.935+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=kube-stats-metrics url=http://localhost:2080/api/collector/push metricsNum=3551 time_took_seconds=0.212283608
level=debug ts=2021-04-02T16:19:21.459+08:00 caller=kube_apiserver.go:357 msg=DoCollectSuccessfullyReadyToPush funcName=api-server metrics_num=2168 time_took_seconds=1.012191635
level=debug ts=2021-04-02T16:19:21.639+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=api-server url=http://localhost:2080/api/collector/push metricsNum=2168 time_took_seconds=0.179650444

能够到 node 下面 手动推一条数据给夜莺的 agent 试试

curl -X POST -H 'Accept: */*' -H 'Accept-Encoding: gzip, deflate' -H 'Connection: keep-alive' -H 'Content-Length: 183' -H 'Content-Type: application/json' -H 'User-Agent: python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-1160.11.1.el7.x86_64' -d '[{"tagsMap": {"k1":"v1"},"step": 15,"endpoint":"1","value": 1,"tags":"k1=v1","timestamp": 1617346924,"metric":"abc_test","extra":"", "nid": "1", "counterType": "COUNTER"}]' http://localhost:2080/api/collector/push 

localhost 还是 127.0.0.1 问题?

  • 在容器外部 ping localhost 看看
退出移动版