关于prometheus:PrometheusOperator云原生监控基于operator部署的资源内部链路分析

本篇要分享的内容

这里假如你曾经实现了kube-prometheus的部署。

假如有个需要：须要将node-exporter的指标裸露到k8s集群内部。如果要搞清楚这个问题，并实现这个需要，须要对通过operator部署的资源、外部链路有肯定的理解才能够。所以，本篇要做这方面的一个分享。

对于在manifests下的清单

这里假如你曾经对prometheus-operator和kube-prometheus有了肯定的理解和应用教训。

在 kube-prometheus 仓库下的 manifests 中，曾经有了用于装置和部署 Prometheus Operator、Alertmanager、Node Exporter、kube-state-metrics 和 Grafana 等组件的 Kubernetes 部署清单。这些清单能够用来在 Kubernetes 集群中部署这些组件，以便能够开始监控集群中的各种指标。不仅如此，它还蕴含了其它资源清单，如 Service、ConfigMap、Role、ClusterRole 等。这些清单文件一起提供了一个残缺的 Prometheus 监控解决方案。

它为什么无奈在内部拿到指标

看看部署形式

nodeExporter-daemonset.yaml

apiVersion: apps/v1kind: DaemonSet......

在k8s中， DaemonSet 是一种用于在 K8S 集群中部署守护过程的控制器，它确保每个节点上都运行一个 Pod 的正本，这使得在整个集群中运行守护过程变得非常容易。DaemonSet 的工作原理是，在每个节点上主动创立 Pod，并且这些 Pod 将始终运行，直到 DaemonSet 被删除或更新为止。如果一个新节点退出集群，DaemonSet 会在该节点上主动创立 Pod。反之，如果节点被删除，它将主动删除对应的 Pod。DaemonSet 罕用于运行一些零碎级别的服务，例如监控代理、日志收集代理等，这些服务须要在每个节点上运行。所以，node-exporter 以 DaemonSet 控制器部署是十分适合的一个解决方案。

查看节点上主动创立的Pod

[root@k8s-a-master manifests-prometheus-operator]# kubectl get pod -n monitoring -o wide | grep node-exporternode-exporter-2wgf7                    2/2     Running   0             3m14s   192.168.11.19    k8s-a-node09   <none>           <none>node-exporter-65fb9                    2/2     Running   0             3m14s   192.168.11.20    k8s-a-node10   <none>           <none>node-exporter-6p2ll                    2/2     Running   0             3m14s   192.168.11.16    k8s-a-node06   <none>           <none>node-exporter-9jnml                    2/2     Running   0             3m14s   192.168.11.12    k8s-a-node02   <none>           <none>node-exporter-cjsr6                    2/2     Running   0             3m14s   192.168.11.17    k8s-a-node07   <none>           <none>node-exporter-d9lqf                    2/2     Running   0             3m14s   192.168.11.13    k8s-a-node03   <none>           <none>node-exporter-gcrx4                    2/2     Running   0             3m14s   192.168.11.10    k8s-a-master   <none>           <none>

因为这里应用了DaemonSet控制器部署node-exporter，因而每个节点上都会运行一个该容器的实例。这意味着每个节点上都会有一个监听9100端口的node-exporter实例，而这些实例都会向Prometheus提供监控数据，使得Prometheus可能集中管理和剖析这些数据。持续往下看，监听端口的局部。

咱们来看一下端口局部

...ports:- containerPort: 9100    hostPort: 9100    name: https...

“name”指定了一个名为“https”的端口，“containerPort”指定了Pod中容器的端口号，即9100。而“hostPort”指定了宿主机节点上的端口号，也是9100。这意味着在任何宿主机节点上，都能够通过拜访9100端口来拜访Pod中的容器。

在页面上查看targets，node-exporter的job已主动增加

在外部，走的是https协定。

内部不给浏览器间接拜访

![图片]()

咱们在集群节点外部用curl试试

[root@k8s-a-node06 ~]# curl https://192.168.11.16:9100/metricscurl: (60) Peer's certificate issuer has been marked as not trusted by the user.More details here: http://curl.haxx.se/docs/sslcerts.htmlcurl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option.If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL).If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option.[root@k8s-a-node06 ~]# curl http://192.168.11.16:9100/metricsClient sent an HTTP request to an HTTPS server.[root@k8s-a-node06 ~]# [root@k8s-a-node06 ~]# curl http://127.0.0.1:9100/metrics# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.# TYPE go_gc_duration_seconds summarygo_gc_duration_seconds{quantile="0"} 2.2247e-05go_gc_duration_seconds{quantile="0.25"} 3.0408e-05go_gc_duration_seconds{quantile="0.5"} 3.2917e-05...process_cpu_seconds_total 2.77# HELP process_max_fds Maximum number of open file descriptors.# TYPE process_max_fds gaugeprocess_max_fds 1.048576e+06# HELP process_open_fds Number of open file descriptors.# TYPE process_open_fds gaugeprocess_open_fds 10# HELP process_resident_memory_bytes Resident memory size in bytes.# TYPE process_resident_memory_bytes gauge...

发现在集群外部应用http协定拜访127.0.0.1是能够拿到指标的，并且没有提醒任何对于证书的问题。也就有可能说，只有让其走http就能够在内部拿到指标了？咱们持续往下看。

nodeExporter yaml清单

先理解每个yaml的用处

[root@k8s-a-master manifests-prometheus-operator]# ls -l nodeExporter-*-rw-r--r-- 1 root root   468 Apr 11 10:30 nodeExporter-clusterRoleBinding.yaml-rw-r--r-- 1 root root   485 Apr 11 10:30 nodeExporter-clusterRole.yaml-rw-r--r-- 1 root root  3640 Apr 27 21:46 nodeExporter-daemonset.yaml-rw-r--r-- 1 root root   671 Apr 11 10:30 nodeExporter-networkPolicy.yaml-rw-r--r-- 1 root root 15214 Apr 11 10:30 nodeExporter-prometheusRule.yaml-rw-r--r-- 1 root root   306 Apr 11 10:30 nodeExporter-serviceAccount.yaml-rw-r--r-- 1 root root   850 Apr 27 21:46 nodeExporter-serviceMonitor.yaml-rw-r--r-- 1 root root   492 Apr 27 21:46 nodeExporter-service.yaml

nodeExporter-clusterRoleBinding.yaml：这个文件定义了一个 ClusterRoleBinding（集群角色绑定）对象，用于受权一个指定的服务帐户（在 nodeExporter-serviceAccount.yaml 文件中定义）拜访与 Node Exporter 相干的资源。
nodeExporter-clusterRole.yaml：这个文件定义了一个 ClusterRole（集群角色）对象，用于授予一组权限，容许 Prometheus Server 拜访 Node Exporter 的指标数据。
nodeExporter-daemonset.yaml：这个文件定义了一个 DaemonSet（守护过程集）对象，用于在 Kubernetes 集群中每个节点上运行一个 Node Exporter 的正本，以便从每个节点收集指标数据。
nodeExporter-networkPolicy.yaml：这个文件定义了一个 NetworkPolicy（网络策略）对象，用于限度从 Prometheus Server 到 Node Exporter 的网络流量，以确保只有来自 Prometheus Server 的流量可能达到 Node Exporter。
nodeExporter-prometheusRule.yaml：这个文件定义了一组 PrometheusRule（Prometheus 规定）对象，用于查看 Node Exporter 的指标数据并生成相应的警报。这些规定定义了要查看的指标及其阈值。
nodeExporter-serviceAccount.yaml：这个文件定义了一个 ServiceAccount（服务帐户）对象，用于受权 Node Exporter 拜访 Kubernetes API。这个服务帐户将被绑定到下面提到的 ClusterRole。
nodeExporter-serviceMonitor.yaml：这个文件定义了一个 ServiceMonitor（服务监控）对象，用于通知 Prometheus Server 如何收集来自 Node Exporter 的指标数据。这个对象定义了 Node Exporter 的服务名称和端口号等信息。
nodeExporter-service.yaml：这个文件定义了一个 Service（服务）对象，用于将 Node Exporter 的网络服务裸露到 Kubernetes 集群中。这个服务将被 nodeExporter-serviceMonitor.yaml 文件中定义的 ServiceMonitor 监控。

同样的，其它grafana、alertmanager、blackboxExporter、kubeStateMetrics等，它们的资源清单也是这样的。

https链路剖析

之前曾经晓得了在外部走的是https协定，并且当初曾经搞清楚了清单里的每个yaml的作用后，置信大脑里曾经产生了上面的一个逻辑图：

接下来就一层一层的找到对于https的配置。

剖析nodeExporter-daemonset

在nodeExporter-daemonset.yaml中上面两个相干的配置：

...containers:- args: - --web.listen-address=127.0.0.1:9100...ports:- containerPort: 9100    hostPort: 9100    name: https...

--web.listen-address=127.0.0.1:9100：这是一个参数，它通知容器在127.0.0.1上监听9100端口的传入申请
containerPort: 9100：这是容器内的端口号。当容器启动时，它将在该端口上监听传入的流量。
hostPort: 9100：这是主机上的端口号。当容器启动时，它将绑定到主机的该端口上。这使得主机上的其余过程能够通过该端口拜访容器中运行的应用程序。
name: https：这是端口的名称。它是一个可选的字段，但在许多状况下都是很有用的，因为它容许您在其余中央援用端口而不是硬编码端口号。

经测试：

端口的名称只是端口名称而已，能够改成任意字符串，比方我改成字符串“http”

name: http

将127.0.0.1改为0.0.0.0，批改后在k8s内部，通过浏览器走http协定能拿到指标：

--web.listen-address=0.0.0.0:9100

![图片]()

难堪的事件产生了，页面中能够看到拿不到指标了：

![图片]()

起因是serviceMonitor还是用https去连贯的，奇怪了，在上一步只是改了监听，改为0.0.0.0而已，难道https曾经不失效啦？持续往下看。

剖析nodeExporter-serviceMonitor和nodeExporter-service

先看nodeExporter-serviceMonitor.yaml，找到相干的字段

...spec:  endpoints:  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token    interval: 15s    port: https # 看这个    relabelings:    - action: replace      regex: (.*)      replacement: $1      sourceLabels:      - __meta_kubernetes_pod_node_name      targetLabel: instance    scheme: https # 还有这个    tlsConfig:      insecureSkipVerify: true  jobLabel: app.kubernetes.io/name  selector:    matchLabels:...

下面的字段port和字段scheme，能够查看官网的API文档，就能够晓得是什么意义

https://prometheus-operator.dev/docs/operator/api/#monitoring...

所以当初答案很显著了，scheme字段是真正决定它是https还是应用http。通过剖析之前的逻辑图，这里的port是指向nodeExporter-service.yaml中的ports.name。

从https批改为http：

scheme: http

为了更有意义，把名称相干的也批改：

port: http

而后，批改nodeExporter-service.yaml中的ports里的name和targetPort

spec:  clusterIP: None  ports:  - name: http    port: 9100    targetPort: http

上面我梳理了一个关系图：

看看最终成果：

曾经胜利让它走http了，并且内部也能间接拿到指标了。

最初

你会发现，当整条链路剖析下来，会对Prometheus Operator这个货色了解的更加粗浅。对于其它资源、或者是本人定义监控业务的资源，在套路上是万变不离其宗。
本文转载于WX公众号：不背锅运维（喜爱的盆友关注咱们）：https://mp.weixin.qq.com/s/dKVt29oO9pXQW7SEe5Hkow