上一篇文章

IoT 边缘集群基于 Kubernetes Events 的告警告诉实现

指标

  1. 告警复原告诉 - 通过评估无奈实现

    1. 起因: 告警和复原是独自齐全不相干的事件, 告警是 Warning 级别, 复原是 Normal 级别, 要开启复原, 就会导致所有 Normal Events 都会被发送, 这个数量是很恐怖的; 而且, 除非特地有教训和急躁, 否则无奈看出哪条 Normal 对应的是 告警的复原.
  2. 未复原进行继续告警 - 默认就带的能力, 无需额定配置.
  3. 告警内容显示资源名称,比方节点和pod名称
  4. 能够设置屏蔽特定的节点和工作负载并能够动静调整

    1. 比方,集群001中的节点worker-1做计划性保护,期间进行监控,保护实现后从新开始监控。

配置

告警内容显示资源名称

典型的几类 events:

apiVersion: v1count: 101557eventTime: nullfirstTimestamp: "2022-04-08T03:50:47Z"involvedObject:  apiVersion: v1  fieldPath: spec.containers{prometheus}  kind: Pod  name: prometheus-rancher-monitoring-prometheus-0  namespace: cattle-monitoring-systemkind: EventlastTimestamp: "2022-04-14T11:39:19Z"message: 'Readiness probe failed: Get "http://10.42.0.87:9090/-/ready": context deadline  exceeded (Client.Timeout exceeded while awaiting headers)'metadata:  creationTimestamp: "2022-04-08T03:51:17Z"  name: prometheus-rancher-monitoring-prometheus-0.16e3cf53f0793344  namespace: cattle-monitoring-systemreason: UnhealthyreportingComponent: ""reportingInstance: ""source:  component: kubelet  host: master-1type: Warning
apiVersion: v1count: 116eventTime: nullfirstTimestamp: "2022-04-13T02:43:26Z"involvedObject:  apiVersion: v1  fieldPath: spec.containers{grafana}  kind: Pod  name: rancher-monitoring-grafana-57777cc795-2b2x5  namespace: cattle-monitoring-systemkind: EventlastTimestamp: "2022-04-14T11:18:56Z"message: 'Readiness probe failed: Get "http://10.42.0.90:3000/api/health": context  deadline exceeded (Client.Timeout exceeded while awaiting headers)'metadata:  creationTimestamp: "2022-04-14T11:18:57Z"  name: rancher-monitoring-grafana-57777cc795-2b2x5.16e5548dd2523a13  namespace: cattle-monitoring-systemreason: UnhealthyreportingComponent: ""reportingInstance: ""source:  component: kubelet  host: master-1type: Warning
apiVersion: v1count: 20958eventTime: nullfirstTimestamp: "2022-04-11T10:34:51Z"involvedObject:  apiVersion: v1  fieldPath: spec.containers{lb-port-1883}  kind: Pod  name: svclb-emqx-dt22t  namespace: emqxkind: EventlastTimestamp: "2022-04-14T11:39:48Z"message: Back-off restarting failed containermetadata:  creationTimestamp: "2022-04-11T10:34:51Z"  name: svclb-emqx-dt22t.16e4d11e2b9efd27  namespace: emqxreason: BackOffreportingComponent: ""reportingInstance: ""source:  component: kubelet  host: worker-1type: Warning
apiVersion: v1count: 21069eventTime: nullfirstTimestamp: "2022-04-11T10:34:48Z"involvedObject:  apiVersion: v1  fieldPath: spec.containers{lb-port-80}  kind: Pod  name: svclb-traefik-r5p8t  namespace: kube-systemkind: EventlastTimestamp: "2022-04-14T11:44:59Z"message: Back-off restarting failed containermetadata:  creationTimestamp: "2022-04-11T10:34:48Z"  name: svclb-traefik-r5p8t.16e4d11daf0b79ce  namespace: kube-systemreason: BackOffreportingComponent: ""reportingInstance: ""source:  component: kubelet  host: worker-1type: Warning
{  "metadata": {    "name": "event-exporter-79544df9f7-xj4t5.16e5c540dc32614f",    "namespace": "monitoring",    "uid": "baf2f642-2383-4e22-87e0-456b6c3eaf4e",    "resourceVersion": "14043444",    "creationTimestamp": "2022-04-14T13:08:40Z"  },  "reason": "Pulled",  "message": "Container image \"ghcr.io/opsgenie/kubernetes-event-exporter:v0.11\" already present on machine",  "source": {    "component": "kubelet",    "host": "worker-2"  },  "firstTimestamp": "2022-04-14T13:08:40Z",  "lastTimestamp": "2022-04-14T13:08:40Z",  "count": 1,  "type": "Normal",  "eventTime": null,  "reportingComponent": "",  "reportingInstance": "",  "involvedObject": {    "kind": "Pod",    "namespace": "monitoring",    "name": "event-exporter-79544df9f7-xj4t5",    "uid": "b77d3e13-fa9e-484b-8a5a-d1afc9edec75",    "apiVersion": "v1",    "resourceVersion": "14043435",    "fieldPath": "spec.containers{event-exporter}",    "labels": {      "app": "event-exporter",      "pod-template-hash": "79544df9f7",      "version": "v1"    }  }}

咱们能够把更多的字段退出到告警信息中, 其中就包含:

  • 节点: {{ Source.Host }}
  • Pod: {{ .InvolvedObject.Name }}

综上, 批改后的event-exporter-cfg yaml 如下:

apiVersion: v1kind: ConfigMapmetadata:  name: event-exporter-cfg  namespace: monitoring  resourceVersion: '5779968'data:  config.yaml: |    logLevel: error    logFormat: json    route:      routes:        - match:            - receiver: "dump"              - drop:            - type: "Normal"          match:            - receiver: "feishu"                         receivers:      - name: "dump"        stdout: {}      - name: "feishu"        webhook:          endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."          headers:            Content-Type: application/json          layout:            msg_type: interactive            card:              config:                wide_screen_mode: true                enable_forward: true              header:                title:                  tag: plain_text                  content: xxx测试K3S集群告警                template: red              elements:                - tag: div                  text:                     tag: lark_md                    content: "**EventID:**  {{ .UID }}\n**EventNamespace:**  {{ .InvolvedObject.Namespace }}\n**EventName:**  {{ .InvolvedObject.Name }}\n**EventType:**  {{ .Type }}\n**EventKind:**  {{ .InvolvedObject.Kind }}\n**EventReason:**  {{ .Reason }}\n**EventTime:**  {{ .LastTimestamp }}\n**EventMessage:**  {{ .Message }}\n**EventComponent:**  {{ .Source.Component }}\n**EventHost:**  {{ .Source.Host }}\n**EventLabels:**  {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:**  {{ toJson .InvolvedObject.Annotations}}"

屏蔽特定的节点和工作负载

比方,集群001中的节点worker-1做计划性保护,期间进行监控,保护实现后从新开始监控。

持续批改event-exporter-cfg yaml 如下:

apiVersion: v1kind: ConfigMapmetadata:  name: event-exporter-cfg  namespace: monitoringdata:  config.yaml: |    logLevel: error    logFormat: json    route:      routes:        - match:            - receiver: "dump"              - drop:            - type: "Normal"            - source:                host: "worker-1"            - namespace: "cattle-monitoring-system"            - name: "*emqx*"            - kind: "Pod|Deployment|ReplicaSet"            - labels:                version: "dev"          match:            - receiver: "feishu"                         receivers:      - name: "dump"        stdout: {}      - name: "feishu"        webhook:          endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."          headers:            Content-Type: application/json          layout:            msg_type: interactive            card:              config:                wide_screen_mode: true                enable_forward: true              header:                title:                  tag: plain_text                  content: xxx测试K3S集群告警                template: red              elements:                - tag: div                  text:                     tag: lark_md                    content: "**EventID:**  {{ .UID }}\n**EventNamespace:**  {{ .InvolvedObject.Namespace }}\n**EventName:**  {{ .InvolvedObject.Name }}\n**EventType:**  {{ .Type }}\n**EventKind:**  {{ .InvolvedObject.Kind }}\n**EventReason:**  {{ .Reason }}\n**EventTime:**  {{ .LastTimestamp }}\n**EventMessage:**  {{ .Message }}\n**EventComponent:**  {{ .Source.Component }}\n**EventHost:**  {{ .Source.Host }}\n**EventLabels:**  {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:**  {{ toJson .InvolvedObject.Annotations}}"

默认的 drop 规定为: - type: "Normal", 即不对 Normal 级别进行告警;

当初退出以下规定:

            - source:                host: "worker-1"            - namespace: "cattle-monitoring-system"            - name: "*emqx*"            - kind: "Pod|Deployment|ReplicaSet"            - labels:                version: "dev"
  • ... host: "worker-1": 不对节点worker-1 做告警;
  • ... namespace: "cattle-monitoring-system": 不对 NameSpace: cattle-monitoring-system 做告警;
  • ... name: "*emqx*": 不对 name(name 往往是 pod name) 蕴含 emqx 的做告警
  • kind: "Pod|Deployment|ReplicaSet": 不对 Pod Deployment ReplicaSet 做告警(也就是不关注利用, 组件相干的告警)
  • ...version: "dev": 不对 label 含有 version: "dev" 的做告警(能够通过它屏蔽特定的利用的告警)

最终成果

如下图:

三人行, 必有我师; 常识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.