共计 6341 个字符,预计需要花费 16 分钟才能阅读完成。
上一篇文章
IoT 边缘集群基于 Kubernetes Events 的告警告诉实现
指标
-
告警复原告诉 – 通过评估无奈实现
- 起因: 告警和复原是独自齐全不相干的事件, 告警是
Warning
级别, 复原是Normal
级别, 要开启复原, 就会导致所有Normal
Events 都会被发送, 这个数量是很恐怖的; 而且, 除非特地有教训和急躁, 否则无奈看出哪条Normal
对应的是 告警的复原.
- 起因: 告警和复原是独自齐全不相干的事件, 告警是
- 未复原进行继续告警 – 默认就带的能力, 无需额定配置.
- 告警内容显示资源名称,比方节点和 pod 名称
-
能够设置屏蔽特定的节点和工作负载并能够动静调整
- 比方,集群
001
中的节点worker-1
做计划性保护,期间进行监控,保护实现后从新开始监控。
- 比方,集群
配置
告警内容显示资源名称
典型的几类 events:
apiVersion: v1 | |
count: 101557 | |
eventTime: null | |
firstTimestamp: "2022-04-08T03:50:47Z" | |
involvedObject: | |
apiVersion: v1 | |
fieldPath: spec.containers{prometheus} | |
kind: Pod | |
name: prometheus-rancher-monitoring-prometheus-0 | |
namespace: cattle-monitoring-system | |
kind: Event | |
lastTimestamp: "2022-04-14T11:39:19Z" | |
message: 'Readiness probe failed: Get"http://10.42.0.87:9090/-/ready": context deadline | |
exceeded (Client.Timeout exceeded while awaiting headers)' | |
metadata: | |
creationTimestamp: "2022-04-08T03:51:17Z" | |
name: prometheus-rancher-monitoring-prometheus-0.16e3cf53f0793344 | |
namespace: cattle-monitoring-system | |
reason: Unhealthy | |
reportingComponent: ""reportingInstance:"" | |
source: | |
component: kubelet | |
host: master-1 | |
type: Warning |
apiVersion: v1 | |
count: 116 | |
eventTime: null | |
firstTimestamp: "2022-04-13T02:43:26Z" | |
involvedObject: | |
apiVersion: v1 | |
fieldPath: spec.containers{grafana} | |
kind: Pod | |
name: rancher-monitoring-grafana-57777cc795-2b2x5 | |
namespace: cattle-monitoring-system | |
kind: Event | |
lastTimestamp: "2022-04-14T11:18:56Z" | |
message: 'Readiness probe failed: Get"http://10.42.0.90:3000/api/health": context | |
deadline exceeded (Client.Timeout exceeded while awaiting headers)' | |
metadata: | |
creationTimestamp: "2022-04-14T11:18:57Z" | |
name: rancher-monitoring-grafana-57777cc795-2b2x5.16e5548dd2523a13 | |
namespace: cattle-monitoring-system | |
reason: Unhealthy | |
reportingComponent: ""reportingInstance:"" | |
source: | |
component: kubelet | |
host: master-1 | |
type: Warning |
apiVersion: v1 | |
count: 20958 | |
eventTime: null | |
firstTimestamp: "2022-04-11T10:34:51Z" | |
involvedObject: | |
apiVersion: v1 | |
fieldPath: spec.containers{lb-port-1883} | |
kind: Pod | |
name: svclb-emqx-dt22t | |
namespace: emqx | |
kind: Event | |
lastTimestamp: "2022-04-14T11:39:48Z" | |
message: Back-off restarting failed container | |
metadata: | |
creationTimestamp: "2022-04-11T10:34:51Z" | |
name: svclb-emqx-dt22t.16e4d11e2b9efd27 | |
namespace: emqx | |
reason: BackOff | |
reportingComponent: ""reportingInstance:"" | |
source: | |
component: kubelet | |
host: worker-1 | |
type: Warning |
apiVersion: v1 | |
count: 21069 | |
eventTime: null | |
firstTimestamp: "2022-04-11T10:34:48Z" | |
involvedObject: | |
apiVersion: v1 | |
fieldPath: spec.containers{lb-port-80} | |
kind: Pod | |
name: svclb-traefik-r5p8t | |
namespace: kube-system | |
kind: Event | |
lastTimestamp: "2022-04-14T11:44:59Z" | |
message: Back-off restarting failed container | |
metadata: | |
creationTimestamp: "2022-04-11T10:34:48Z" | |
name: svclb-traefik-r5p8t.16e4d11daf0b79ce | |
namespace: kube-system | |
reason: BackOff | |
reportingComponent: ""reportingInstance:"" | |
source: | |
component: kubelet | |
host: worker-1 | |
type: Warning |
{ | |
"metadata": { | |
"name": "event-exporter-79544df9f7-xj4t5.16e5c540dc32614f", | |
"namespace": "monitoring", | |
"uid": "baf2f642-2383-4e22-87e0-456b6c3eaf4e", | |
"resourceVersion": "14043444", | |
"creationTimestamp": "2022-04-14T13:08:40Z" | |
}, | |
"reason": "Pulled", | |
"message": "Container image \"ghcr.io/opsgenie/kubernetes-event-exporter:v0.11\"already present on machine", | |
"source": { | |
"component": "kubelet", | |
"host": "worker-2" | |
}, | |
"firstTimestamp": "2022-04-14T13:08:40Z", | |
"lastTimestamp": "2022-04-14T13:08:40Z", | |
"count": 1, | |
"type": "Normal", | |
"eventTime": null, | |
"reportingComponent": "","reportingInstance":"", | |
"involvedObject": { | |
"kind": "Pod", | |
"namespace": "monitoring", | |
"name": "event-exporter-79544df9f7-xj4t5", | |
"uid": "b77d3e13-fa9e-484b-8a5a-d1afc9edec75", | |
"apiVersion": "v1", | |
"resourceVersion": "14043435", | |
"fieldPath": "spec.containers{event-exporter}", | |
"labels": { | |
"app": "event-exporter", | |
"pod-template-hash": "79544df9f7", | |
"version": "v1" | |
} | |
} | |
} |
咱们能够把更多的字段退出到告警信息中, 其中就包含:
- 节点:
{{Source.Host}}
- Pod:
{{.InvolvedObject.Name}}
综上, 批改后的event-exporter-cfg
yaml 如下:
apiVersion: v1 | |
kind: ConfigMap | |
metadata: | |
name: event-exporter-cfg | |
namespace: monitoring | |
resourceVersion: '5779968' | |
data: | |
config.yaml: | | |
logLevel: error | |
logFormat: json | |
route: | |
routes: | |
- match: | |
- receiver: "dump" | |
- drop: | |
- type: "Normal" | |
match: | |
- receiver: "feishu" | |
receivers: | |
- name: "dump" | |
stdout: {} | |
- name: "feishu" | |
webhook: | |
endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..." | |
headers: | |
Content-Type: application/json | |
layout: | |
msg_type: interactive | |
card: | |
config: | |
wide_screen_mode: true | |
enable_forward: true | |
header: | |
title: | |
tag: plain_text | |
content: xxx 测试 K3S 集群告警 | |
template: red | |
elements: | |
- tag: div | |
text: | |
tag: lark_md | |
content: "**EventID:** {{.UID}}\n**EventNamespace:** {{.InvolvedObject.Namespace}}\n**EventName:** {{.InvolvedObject.Name}}\n**EventType:** {{.Type}}\n**EventKind:** {{.InvolvedObject.Kind}}\n**EventReason:** {{.Reason}}\n**EventTime:** {{.LastTimestamp}}\n**EventMessage:** {{.Message}}\n**EventComponent:** {{.Source.Component}}\n**EventHost:** {{.Source.Host}}\n**EventLabels:** {{toJson .InvolvedObject.Labels}}\n**EventAnnotations:** {{toJson .InvolvedObject.Annotations}}" |
屏蔽特定的节点和工作负载
比方,集群
001
中的节点worker-1
做计划性保护,期间进行监控,保护实现后从新开始监控。
持续批改event-exporter-cfg
yaml 如下:
apiVersion: v1 | |
kind: ConfigMap | |
metadata: | |
name: event-exporter-cfg | |
namespace: monitoring | |
data: | |
config.yaml: | | |
logLevel: error | |
logFormat: json | |
route: | |
routes: | |
- match: | |
- receiver: "dump" | |
- drop: | |
- type: "Normal" | |
- source: | |
host: "worker-1" | |
- namespace: "cattle-monitoring-system" | |
- name: "*emqx*" | |
- kind: "Pod|Deployment|ReplicaSet" | |
- labels: | |
version: "dev" | |
match: | |
- receiver: "feishu" | |
receivers: | |
- name: "dump" | |
stdout: {} | |
- name: "feishu" | |
webhook: | |
endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..." | |
headers: | |
Content-Type: application/json | |
layout: | |
msg_type: interactive | |
card: | |
config: | |
wide_screen_mode: true | |
enable_forward: true | |
header: | |
title: | |
tag: plain_text | |
content: xxx 测试 K3S 集群告警 | |
template: red | |
elements: | |
- tag: div | |
text: | |
tag: lark_md | |
content: "**EventID:** {{.UID}}\n**EventNamespace:** {{.InvolvedObject.Namespace}}\n**EventName:** {{.InvolvedObject.Name}}\n**EventType:** {{.Type}}\n**EventKind:** {{.InvolvedObject.Kind}}\n**EventReason:** {{.Reason}}\n**EventTime:** {{.LastTimestamp}}\n**EventMessage:** {{.Message}}\n**EventComponent:** {{.Source.Component}}\n**EventHost:** {{.Source.Host}}\n**EventLabels:** {{toJson .InvolvedObject.Labels}}\n**EventAnnotations:** {{toJson .InvolvedObject.Annotations}}" |
默认的 drop 规定为: - type: "Normal"
, 即不对 Normal
级别进行告警;
当初退出以下规定:
- source: | |
host: "worker-1" | |
- namespace: "cattle-monitoring-system" | |
- name: "*emqx*" | |
- kind: "Pod|Deployment|ReplicaSet" | |
- labels: | |
version: "dev" |
... host: "worker-1"
: 不对节点worker-1
做告警;... namespace: "cattle-monitoring-system"
: 不对 NameSpace:cattle-monitoring-system
做告警;... name: "*emqx*"
: 不对 name(name 往往是 pod name) 蕴含emqx
的做告警kind: "Pod|Deployment|ReplicaSet"
: 不对Pod
Deployment
ReplicaSet
做告警(也就是不关注利用, 组件相干的告警)...version: "dev"
: 不对label
含有version: "dev"
的做告警(能够通过它屏蔽特定的利用的告警)
最终成果
如下图:
🎉🎉🎉
三人行, 必有我师; 常识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.
正文完