使用Prometheus+Alertmanager告警JVM异常情况

jiezi

6 年前

原文地址
在前一篇文章中提到了如何使用 Prometheus+Grafana 来监控 JVM。本文介绍如何使用 Prometheus+Alertmanager 来对 JVM 的某些情况作出告警。
本文所提到的脚本可以在这里下载。
摘要
用到的工具：

Docker，本文大量使用了 Docker 来启动各个应用。

Prometheus，负责抓取 / 存储指标信息，并提供查询功能，本文重点使用它的告警功能。

Grafana，负责数据可视化（本文重点不在于此，只是为了让读者能够直观地看到异常指标）。
Alertmanager，负责将告警通知给相关人员。

JMX exporter，提供 JMX 中和 JVM 相关的 metrics。
Tomcat，用来模拟一个 Java 应用。

先讲一下大致步骤：

利用 JMX exporter，在 Java 进程内启动一个小型的 Http server
配置 Prometheus 抓取那个 Http server 提供的 metrics。

配置 Prometheus 的告警触发规则

heap 使用超过最大上限的 50%、80%、90%
instance down 机时间超过 30 秒、1 分钟、5 分钟
old gc 时间在最近 5 分钟里超过 50%、80%

配置 Grafana 连接 Prometheus，配置 Dashboard。
配置 Alertmanager 的告警通知规则

告警的大致过程如下：

Prometheus 根据告警触发规则查看是否触发告警，如果是，就将告警信息发送给 Alertmanager。

Alertmanager 收到告警信息后，决定是否发送通知，如果是，则决定发送给谁。

第一步：启动几个 Java 应用
1) 新建一个目录，名字叫做 prom-jvm-demo。
2) 下载 JMX exporter 到这个目录。
3) 新建一个文件 simple-config.yml 内容如下：
—
blacklistObjectNames: [“*:*”]
4) 运行以下命令启动 3 个 Tomcat，记得把 <path-to-prom-jvm-demo> 替换成正确的路径（这里故意把 -Xmx 和 -Xms 设置的很小，以触发告警条件）：
docker run -d \
–name tomcat-1 \
-v <path-to-prom-jvm-demo>:/jmx-exporter \
-e CATALINA_OPTS=”-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml” \
-p 6060:6060 \
-p 8080:8080 \
tomcat:8.5-alpine

docker run -d \
–name tomcat-2 \
-v <path-to-prom-jvm-demo>:/jmx-exporter \
-e CATALINA_OPTS=”-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml” \
-p 6061:6060 \
-p 8081:8080 \
tomcat:8.5-alpine

docker run -d \
–name tomcat-3 \
-v <path-to-prom-jvm-demo>:/jmx-exporter \
-e CATALINA_OPTS=”-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml” \
-p 6062:6060 \
-p 8082:8080 \
tomcat:8.5-alpine
5) 访问 http://localhost:8080|8081|8082 看看 Tomcat 是否启动成功。
6) 访问对应的 http://localhost:6060|6061|6062 看看 JMX exporter 提供的 metrics。
备注：这里提供的 simple-config.yml 仅仅提供了 JVM 的信息，更复杂的配置请参考 JMX exporter 文档。
第二步：启动 Prometheus
1) 在之前新建目录 prom-jvm-demo，新建一个文件 prom-jmx.yml，内容如下：
crape_configs:
– job_name: ‘java’
static_configs:
– targets:
– ‘<host-ip>:6060’
– ‘<host-ip>:6061’
– ‘<host-ip>:6062’

# alertmanager 的地址
alerting:
alertmanagers:
– static_configs:
– targets:
– ‘<host-ip>:9093’

# 读取告警触发条件规则
rule_files:
– ‘/prometheus-config/prom-alert-rules.yml’
2) 新建文件 prom-alert-rules.yml，该文件是告警触发规则：
# severity 按严重程度由高到低：red、orange、yello、blue
groups:
– name: jvm-alerting
rules:

# down 了超过 30 秒
– alert: instance-down
expr: up == 0
for: 30s
labels:
severity: yellow
annotations:
summary: “Instance {{$labels.instance}} down”
description: “{{$labels.instance}} of job {{$labels.job}} has been down for more than 30 seconds.”

# down 了超过 1 分钟
– alert: instance-down
expr: up == 0
for: 1m
labels:
severity: orange
annotations:
summary: “Instance {{$labels.instance}} down”
description: “{{$labels.instance}} of job {{$labels.job}} has been down for more than 1 minutes.”

# down 了超过 5 分钟
– alert: instance-down
expr: up == 0
for: 5m
labels:
severity: blue
annotations:
summary: “Instance {{$labels.instance}} down”
description: “{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes.”

# 堆空间使用超过 50%
– alert: heap-usage-too-much
expr: jvm_memory_bytes_used{job=”java”, area=”heap”} / jvm_memory_bytes_max * 100 > 50
for: 1m
labels:
severity: yellow
annotations:
summary: “JVM Instance {{$labels.instance}} memory usage > 50%”
description: “{{$labels.instance}} of job {{$labels.job}} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{$value}}%)”

# 堆空间使用超过 80%
– alert: heap-usage-too-much
expr: jvm_memory_bytes_used{job=”java”, area=”heap”} / jvm_memory_bytes_max * 100 > 80
for: 1m
labels:
severity: orange
annotations:
summary: “JVM Instance {{$labels.instance}} memory usage > 80%”
description: “{{$labels.instance}} of job {{$labels.job}} has been in status [heap usage > 80%] for more than 1 minutes. current usage ({{$value}}%)”

# 堆空间使用超过 90%
– alert: heap-usage-too-much
expr: jvm_memory_bytes_used{job=”java”, area=”heap”} / jvm_memory_bytes_max * 100 > 90
for: 1m
labels:
severity: red
annotations:
summary: “JVM Instance {{$labels.instance}} memory usage > 90%”
description: “{{$labels.instance}} of job {{$labels.job}} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{$value}}%)”

# 在 5 分钟里，Old GC 花费时间超过 30%
– alert: old-gc-time-too-much
expr: increase(jvm_gc_collection_seconds_sum{gc=”PS MarkSweep”}[5m]) > 5 * 60 * 0.3
for: 5m
labels:
severity: yellow
annotations:
summary: “JVM Instance {{$labels.instance}} Old GC time > 30% running time”
description: “{{$labels.instance}} of job {{$labels.job}} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{$value}}%)”

# 在 5 分钟里，Old GC 花费时间超过 50%
– alert: old-gc-time-too-much
expr: increase(jvm_gc_collection_seconds_sum{gc=”PS MarkSweep”}[5m]) > 5 * 60 * 0.5
for: 5m
labels:
severity: orange
annotations:
summary: “JVM Instance {{$labels.instance}} Old GC time > 50% running time”
description: “{{$labels.instance}} of job {{$labels.job}} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{$value}}%)”

# 在 5 分钟里，Old GC 花费时间超过 80%
– alert: old-gc-time-too-much
expr: increase(jvm_gc_collection_seconds_sum{gc=”PS MarkSweep”}[5m]) > 5 * 60 * 0.8
for: 5m
labels:
severity: red
annotations:
summary: “JVM Instance {{$labels.instance}} Old GC time > 80% running time”
description: “{{$labels.instance}} of job {{$labels.job}} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{$value}}%)”
3) 启动 Prometheus：
docker run -d \
–name=prometheus \
-p 9090:9090 \
-v <path-to-prom-jvm-demo>:/prometheus-config \
prom/prometheus –config.file=/prometheus-config/prom-jmx.yml
4) 访问 http://localhost:9090/alerts 应该能看到之前配置的告警规则：

如果没有看到三个 instance，那么等一会儿再试。
第三步：配置 Grafana
参考使用 Prometheus+Grafana 监控 JVM
第四步：启动 Alertmanager
1) 新建一个文件 alertmanager-config.yml：
global:
smtp_smarthost: ‘<smtp.host:ip>’
smtp_from: ‘<from>’
smtp_auth_username: ‘<username>’
smtp_auth_password: ‘<password>’

# The directory from which notification templates are read.
templates:
– ‘/alertmanager-config/*.tmpl’

# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: [‘alertname’, ‘instance’]

# When a new group of alerts is created by an incoming alert, wait at
# least ‘group_wait’ to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s

# When the first notification was sent, wait ‘group_interval’ to send a batch
# of new alerts that started firing for that group.
group_interval: 5m

# If an alert has successfully been sent, wait ‘repeat_interval’ to
# resend them.
repeat_interval: 3h

# A default receiver
receiver: “user-a”

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
– source_match:
severity: ‘red’
target_match_re:
severity: ^(blue|yellow|orange)$
# Apply inhibition if the alertname and instance is the same.
equal: [‘alertname’, ‘instance’]
– source_match:
severity: ‘orange’
target_match_re:
severity: ^(blue|yellow)$
# Apply inhibition if the alertname and instance is the same.
equal: [‘alertname’, ‘instance’]
– source_match:
severity: ‘yellow’
target_match_re:
severity: ^(blue)$
# Apply inhibition if the alertname and instance is the same.
equal: [‘alertname’, ‘instance’]

receivers:
– name: ‘user-a’
email_configs:
– to: ‘<user-a@domain.com>’
修改里面关于 smtp_* 的部分和最下面 user- a 的邮箱地址。
备注：因为国内邮箱几乎都不支持 TLS，而 Alertmanager 目前又不支持 SSL，因此请使用 Gmail 或其他支持 TLS 的邮箱来发送告警邮件，见这个 issue
2) 新建文件 alert-template.tmpl，这个是邮件内容模板：
{{define “email.default.html”}}
<h2>Summary</h2>

<p>{{.CommonAnnotations.summary}}</p>

<h2>Description</h2>

<p>{{.CommonAnnotations.description}}</p>
{{end}}
3）运行下列命令启动：
docker run -d \
–name=alertmanager \
-v <path-to-prom-jvm-demo>:/alertmanager-config \
-p 9093:9093 \
prom/alertmanager –config.file=/alertmanager-config/alertmanager-config.yml
4) 访问 http://localhost:9093，看看有没有收到 Prometheus 发送过来的告警 (如果没有看到稍等一下)：

第五步：等待邮件
等待一会儿（最多 5 分钟）看看是否收到邮件。如果没有收到，检查配置是否正确，或者 docker logs alertmanager 看看 alertmanager 的日志，一般来说都是邮箱配置错误导致。