k8s 中的监控原理、prometheus 采集原理 能够看这个文章
k8s 监控指标汇总,prometheus 采集 k8s 原理解析
k8s-mon 我的项目介绍
我的项目地址
-
https://github.com/n9e/k8s-mon
视频介绍
- https://www.bilibili.com/vide…
kube-stats-metrics 没数据
排查思路 dns 问题
首先察看 k8s-mon-deployment 的日志
kubectl logs -l app=k8s-mon-deployment -n kube-admin |grep 8080
# 如有下列报错阐明网络不通
# err="Get \"http://kube-state-metrics.kube-system:8080/metrics\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
排查 dns,在 node 上申请 coredns 服务
root@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl get svc -n kube-system |grep dns
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 73d
在 node 上申请 coredns 解析 kube-stats-metrics 域名
# 10.96.0.10 为申请到的 coredns svc 地址
# 因为 node 上的搜寻域没有 svc.cluster.local,所以须要 FQDN
dig kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10
# 如果失常的话则会有如下 A 记录
root@k8s-local-test-01:~$ dig kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12799
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;kube-state-metrics.kube-system.svc.cluster.local. IN A
;; ANSWER SECTION:
kube-state-metrics.kube-system.svc.cluster.local. 25 IN A 10.100.30.129
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Fri Apr 02 15:14:46 CST 2021
;; MSG SIZE rcvd: 141
在 node 上申请 kube-stats-metrics
root@k8s-local-test-01:/etc/kubernetes/manifests$ curl -s 10.100.30.129:8080/metrics |head
# HELP kube_certificatesigningrequest_labels Kubernetes labels converted to Prometheus labels.
# TYPE kube_certificatesigningrequest_labels gauge
# HELP kube_certificatesigningrequest_created Unix creation timestamp
# TYPE kube_certificatesigningrequest_created gauge
# HELP kube_certificatesigningrequest_condition The number of each certificatesigningrequest condition
# TYPE kube_certificatesigningrequest_condition gauge
# HELP kube_certificatesigningrequest_cert_length Length of the issued cert
# TYPE kube_certificatesigningrequest_cert_length gauge
# HELP kube_configmap_info Information about configmap.
# TYPE kube_configmap_info gauge
如果有输入证实 node 上申请 coredns 没问题,申请 ksm 服务也没问题
而后进入 k8s-mon-deployment 容器中查看
# 进入容器命令
kubectl -n kube-admin exec "$(kubectl -nkube-admin get pod -l app=k8s-mon-deployment -o jsonpath='{.items[0].metadata.name}')" -ti -- /bin/sh
# ping 一下 kube-state-metrics.kube-system
PING kube-state-metrics.kube-system (10.100.30.129): 56 data bytes
64 bytes from 10.100.30.129: seq=0 ttl=64 time=0.097 ms
64 bytes from 10.100.30.129: seq=1 ttl=64 time=0.093 ms
64 bytes from 10.100.30.129: seq=2 ttl=64 time=0.114 ms
64 bytes from 10.100.30.129: seq=3 ttl=64 time=0.124 ms
# wget 申请一下 ksm 服务
/ # wget http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics -O m |head m
# HELP kube_certificatesigningrequest_labels Kubernetes labels converted to Prometheus labels.
# TYPE kube_certificatesigningrequest_labels gauge
# HELP kube_certificatesigningrequest_created Unix creation timestamp
# TYPE kube_certificatesigningrequest_created gauge
# HELP kube_certificatesigningrequest_condition The number of each certificatesigningrequest condition
# TYPE kube_certificatesigningrequest_condition gauge
# HELP kube_certificatesigningrequest_cert_length Length of the issued cert
# TYPE kube_certificatesigningrequest_cert_length gauge
# HELP kube_configmap_info Information about configmap.
# TYPE kube_configmap_info gauge
Connecting to kube-state-metrics.kube-system.svc.cluster.local:8080 (10.100.30.129:8080)
如果在 node 上能够获取到,但在 pod 中获取不到思考 coredns 有问题或者容器网络 有问题
打印 coredns 日志
oot@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl logs -l k8s-app=kube-dns -n kube-system -f
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
root@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl logs -l k8s-app=kube-dns -n kube-system |grep -i error
容器网络问题 能够依照这个文档排查 https://juejin.cn/post/684490…
apiserver 等服务组件没数据
排查思路 先看看日志 报什么错
在 node 上手动 带 token 申请下 apiserver 的 metrics
TOKEN=$(kubectl -n kube-admin get secret $(kubectl -n kube-admin get serviceaccount k8s-mon -o jsonpath='{.secrets[0].name}') -o jsonpath='{.data.token}' | base64 --decode )
curl https://localhost:6443/metrics --header "Authorization: Bearer $TOKEN" --insecure
# 如果失常的话能够看到 metrics 数据
服务组件没有部署在 pod 中的须要在 configMap 中给出地址 并设置
user_specified:true
kube_scheduler:
user_specified: true
addrs:
- "https://1.1.1.1:1234/metrics"
- "https://2.2.2.2:1234/metrics"
日志中报 push 到夜莺 agent 的谬误
level=error ts=2021-04-02T14:44:21.560+08:00 caller=push.go:79 msg=HttpPostPushDataBuildNewHttpPostReqError2 funcName=api-server url=http://localhost:2080/api/collector/push err="Post \"http://localhost:2080/api/collector/push\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
能够将 k8s-mon 的日志改为 debug 查看下
command:
- /opt/app/k8s-mon
- --config.file=/etc/k8s-mon/k8s-mon.yml
- --log.level=debug
debug 日志会打印每个 阶段的耗时
level=debug ts=2021-04-02T16:19:20.723+08:00 caller=kube_state_metrics.go:244 msg=DoCollectSuccessfullyReadyToPush funcName=kube-stats-metrics metrics_num=3551 time_took_seconds=0.276154232 metric_addr=http://kube-state-metrics.kube-system:8080/metrics
level=debug ts=2021-04-02T16:19:20.733+08:00 caller=kube_controller_manager.go:180 msg=DoCollectSuccessfullyReadyToPush funcName=kube-controller-manager metrics_num=642 time_took_seconds=0.286183625
level=debug ts=2021-04-02T16:19:20.845+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=kube-controller-manager url=http://localhost:2080/api/collector/push metricsNum=642 time_took_seconds=0.111731185
level=debug ts=2021-04-02T16:19:20.935+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=kube-stats-metrics url=http://localhost:2080/api/collector/push metricsNum=3551 time_took_seconds=0.212283608
level=debug ts=2021-04-02T16:19:21.459+08:00 caller=kube_apiserver.go:357 msg=DoCollectSuccessfullyReadyToPush funcName=api-server metrics_num=2168 time_took_seconds=1.012191635
level=debug ts=2021-04-02T16:19:21.639+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=api-server url=http://localhost:2080/api/collector/push metricsNum=2168 time_took_seconds=0.179650444
能够到 node 下面 手动推一条数据给夜莺的 agent 试试
curl -X POST -H 'Accept: */*' -H 'Accept-Encoding: gzip, deflate' -H 'Connection: keep-alive' -H 'Content-Length: 183' -H 'Content-Type: application/json' -H 'User-Agent: python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-1160.11.1.el7.x86_64' -d '[{"tagsMap": {"k1":"v1"},"step": 15,"endpoint":"1","value": 1,"tags":"k1=v1","timestamp": 1617346924,"metric":"abc_test","extra":"", "nid": "1", "counterType": "COUNTER"}]' http://localhost:2080/api/collector/push
localhost 还是 127.0.0.1 问题?
- 在容器外部 ping localhost 看看