关于prometheus:Prometheus监控告警浅析

前言

最近有个新我的项目须要搞一套残缺的监控告警零碎，咱们应用了开源监控告警零碎Prometheus；其功能强大，能够很不便对其进行扩大，并且能够装置和应用简略；本文首先介绍Prometheus的整个监控流程；而后介绍如何收集监控数据，如何展现监控数据，如何触发告警；最初展现一个业务系统监控的demo。

监控架构

Prometheus的整个架构流程能够参考如下图片：

整个流程大抵分为收集数据，存储数据，展现监控数据，监控告警；外围组件包含：Exporters，Prometheus Server，AlertManager，PushGateway；

Exporters：监控数据采集器，将数据通过Http的形式裸露给Prometheus Server；
Prometheus Server：负责对监控数据的获取，存储以及查问；获取的监控数据须要是指定的Metrics 格局，这样能力解决监控数据；对于查问Prometheus提供了PromQL不便对数据进行查问汇总，当然Prometheus自身也提供了Web UI；
AlertManager：Prometheus反对通过PromQL来创立告警规定，如果满足规定则创立一条告警，后续的告警流程就交给AlertManager，其提供了多种告警形式包含email，webhook等形式；
PushGateway：失常状况下Prometheus Server可能间接与Exporter进行通信，而后pull数据；当网络需要无奈满足时就能够应用PushGateway作为中转站了；

收集数据

Exporter的次要性能就是收集数据，而后将数据通过http的形式裸露给Prometheus，而后Prometheus通过定期拉取的形式来获取监控数据；
数据的起源多种多样包含：零碎级监控数据比方节点的cpu，io等，中间件比方mysql，mq等，过程级监控比方jvm等，业务监控数据等；除了监控的业务数据每个零碎可能不一样，除此之外其余的监控数据其实每个零碎都是大同小异的；所以在Exporter的起源分成了两类：社区提供的，用户自定义的；

Exporter起源

社区提供

范畴	罕用Exporter
数据库	MySQL Exporter, Redis Exporter, MongoDB Exporter等
硬件	Node Exporter等
音讯队列	Kafka Exporter, RabbitMQ Exporter等
HTTP服务	Apache Exporter, Nginx Exporter等
存储	HDFS Exporter等
API服务	Docker Hub Exporter, GitHub Exporter等
其余	JIRA Exporter, Jenkins Exporter， Confluence Exporter等

官网提供的第三方Exporter：Exporters

用户自定义

除了以上提供的第三方Exporter，用户也能够自定义Exporter，当然须要基于Prometheus提供的Client Library创立本人的Exporter程序，提供了对多种语言的反对包含：Go、Java/Scala、Python、Ruby等；

Exporter运行形式

从Exporter的运行形式上来讲，又能够分为：独立运行和集成到利用中；

独立运行

像mysql，redis，mq这些中间件自身时不反对Prometheus，这时候就能够提供一个独立的Exporter，通过中间件对外提供的监控数据API，来获取监控数据，而后转换成Prometheus能够辨认的数据格式；

集成到利用中

一些须要自定义监控指标的零碎，能够通过Prometheus提供的Client Library将监控数据在零碎外部提供给Prometheus；

数据格式

Prometheus通过轮询的形式从Exporter获取监控数据，当然数据须要遵循肯定的格局，不然Prometheus也是无奈辨认的，这个格局就是 Metrics 格局.

<metric name>{<label name>=<label value>, ...}

次要分为三个局部各个局部需合乎相干的正则表达式

metric name：指标的名称，次要反映被监控样本的含意 a-zA-Z_:*_
label name: 标签反映了以后样本的特色维度 [a-zA-Z0-9_]*
label value: 各个标签的值，不限度格局

能够看一个JVM的监控数据：

# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management# TYPE jvm_memory_max_bytes gaugejvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="nonheap",id="Metaspace",} -1.0jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="heap",id="PS Eden Space",} 1.033895936E9jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="nonheap",id="Code Cache",} 2.5165824E8jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="nonheap",id="Compressed Class Space",} 1.073741824E9jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="heap",id="PS Survivor Space",} 2621440.0jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="heap",id="PS Old Gen",} 2.09190912E9

更多：data_model

数据类型

Prometheus定义了4种不同的指标类型(metric type)：Counter（计数器）、Gauge（仪表盘）、Histogram（直方图）、Summary（摘要）

Counter

只增不减的计数器，比方能够在应用程序中记录某些事件产生的次数；常见的监控指标，如http_requests_total；

# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next# TYPE jvm_gc_memory_allocated_bytes_total counterjvm_gc_memory_allocated_bytes_total{application="springboot-actuator-prometheus-test",} 6.3123664E9

Gauge

侧重于反馈零碎的以后状态，可增可减；常见指标如：node_memory_MemFree（主机以后闲暇的内容大小）、node_memory_MemAvailable（可用内存大小）；

# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads# TYPE jvm_threads_live_threads gaugejvm_threads_live_threads{application="springboot-actuator-prometheus-test",} 20.0

Histogram和Summary

主用用于统计和剖析样本的散布状况

# HELP jvm_gc_pause_seconds Time spent in GC pause# TYPE jvm_gc_pause_seconds summaryjvm_gc_pause_seconds_count{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 1.0jvm_gc_pause_seconds_sum{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 0.008jvm_gc_pause_seconds_count{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Allocation Failure",} 38.0jvm_gc_pause_seconds_sum{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Allocation Failure",} 0.134jvm_gc_pause_seconds_count{action="end of major GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 1.0jvm_gc_pause_seconds_sum{action="end of major GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 0.073

更多：metric_types

展现数据

Prometheus能够通过内置的Prometheus UI以及Grafana来展现数据，Prometheus UI是Prometheus自带的Web UI，能够不便的用来执行测试PromQL；
Grafana是一款采纳go语言编写的开源利用，容许您从Elasticsearch，Prometheus，Graphite，InfluxDB等各种数据源中获取数据，并通过精美的图形将其可视化；

Prometheus UI

主界面大抵如下：

所有注册的Exporter都能够在UI查看，告警也能够在Alerts界面查看，同时也能够执行PromQL来查问监控数据，进行展现；

Grafana

在Grafana中每个监控查问都能够做成一个面板，面板能够有多种展现形式，比方：

PromQL简介

PromQL是Prometheus内置的数据查询语言，能够类比成SQL；提供了丰盛的查问，逻辑运算，聚合函数等等；

操作符

操作符包含：数学运算符，逻辑运算符，布尔运算符等等；比方：

rabbitmq_queue_messages>0

聚合函数

提供了大量的内置函数，比方： sum (求和)， min (最小值)，max (最大值)，avg (平均值)等等；

sum(rabbitmq_queue_messages)>0

更多：PromQL

告警

告警的流程大抵就是：在prometheus中通过PromQL配置告警规定，如果规定成立，则发送一条音讯给接收者，这里的接收者其实就是AlertManager，AlertManager能够配置多种告警办法如email，webhook等；

自定义告警规定

Prometheus中的告警规定容许你基于PromQL表达式定义告警触发条件，Prometheus后端对这些触发规定进行周期性计算，当满足触发条件后则会触发告警告诉；

比方如下告警规定：

- name: queue-messages-warning  rules:  - alert: queue-messages-warning    expr: sum(rabbitmq_queue_messages{job='rabbit-state-metrics'}) > 500    labels:      team: webhook-warning    annotations:      summary: High queue-messages usage detected      threshold: 500      current: '{{ $value }}'

alert：告警规定的名称；
expr：基于PromQL表达式告警触发条件；
labels：自定义标签，通过其关联到具体Alertmanager上；
annotations：用于指定一组附加信息，比方用于形容告警详细信息的文字等；

AlertManager

AlertManager是一个告警管理器，它提供了丰盛的告警形式包含：电子邮件，pagerduty，OpsGenie, webhook 等；在如上的告警规定表达式胜利之后，能够将告警发送给AlertManager，由AlertManager来讲告警以更加丰盛的形式通知给开发人员；

global:  resolve_timeout: 5mroute:  receiver: webhook  group_wait: 30s  group_interval: 1m  repeat_interval: 5m  group_by:    - alertname  routes:    - receiver: webhook      group_wait: 10s      match:       team: webhook-warningreceivers:  - name: webhook    webhook_configs:      - url: 'http://ip:port/api/v1/monitor/alert-receiver'        send_resolved: true

以上即是在AlertManager中配置的路由和接收者webhook；
更多：alerting

装置与配置

上面看一个几个外围组件的装置包含：Prometheus，AlertManager，Exporter，Grafana；所有组件的装置都是基于k8s平台；

Prometheus和AlertManager

如下yml文件别离装置了Prometheus和AlertManager，如下所示：

apiVersion: apps/v1kind: Deploymentmetadata:  annotations:    deployment.kubernetes.io/revision: '18'  generation: 18  labels:    app: prometheus  name: prometheus  namespace: monitoringspec:  progressDeadlineSeconds: 600  replicas: 1  revisionHistoryLimit: 10  selector:    matchLabels:      app: prometheus  template:    metadata:      labels:        app: prometheus    spec:      containers:        - image: 'prom/prometheus:latest'          imagePullPolicy: Always          name: prometheus-0          ports:            - containerPort: 9090              name: p-port              protocol: TCP          resources:            requests:              cpu: 250m              memory: 512Mi          terminationMessagePath: /dev/termination-log          terminationMessagePolicy: File          volumeMounts:            - mountPath: /etc/prometheus              name: config-volume        - image: 'prom/alertmanager:latest'          imagePullPolicy: Always          name: prometheus-1          ports:            - containerPort: 9093              name: a-port              protocol: TCP          resources: {}          terminationMessagePath: /dev/termination-log          terminationMessagePolicy: File          volumeMounts:            - mountPath: /etc/alertmanager              name: alertcfg      dnsPolicy: ClusterFirst      restartPolicy: Always      schedulerName: default-scheduler      securityContext: {}      terminationGracePeriodSeconds: 30      volumes:        - name: data          persistentVolumeClaim:            claimName: monitoring-nfs-pvc        - configMap:            defaultMode: 420            name: prometheus-config          name: config-volume        - configMap:            defaultMode: 420            name: alert-config          name: alertcfg

其中指定了两个镜像别离是prom/prometheus:latest和prom/alertmanager:latest，以及指定对外的端口；因为启动两个容器须要用到配置文件prometheus.yml和alertmanager.yml，通过在volumes中配置了prometheus-config和alert-config两个配置字典：

prometheus.yml配置如下：

global:  scrape_interval:     15s  evaluation_interval: 15srule_files:  - 'rabbitmq_warn.yml'alerting:  alertmanagers:    - static_configs:      - targets: ['127.0.0.1:9093']scrape_configs:- job_name: 'rabbit-state-metrics'  static_configs:    - targets: ['ip:port']

其中配置了alertmanager，以及规定文件rabbitmq_warn.yml，还有配置了须要收集监控信息的exporter，也就是这边的job_name，能够配置多个；

查看Exporter

启动prometheus之后能够在prometheus web ui中查看相干exporter以及告警规定：

能够在status/targets目录下查看到以后的所有exporter，如果状态都为up示意，示意prometheus曾经能够承受监控数据了，比方我这里配置的接管rabbitmq相干监控数据；

查看Alerts

配置的相干告警也能够在prometheus web ui中查看：

如果告警规定成立会显示红色，当然同时也会发送音讯给alertmanager；

Grafana

grafana装置yml文件如下所示：

apiVersion: apps/v1kind: Deploymentmetadata:  annotations:    deployment.kubernetes.io/revision: '1'  generation: 1  labels:    app: grafana  name: grafana  namespace: monitoringspec:  progressDeadlineSeconds: 600  replicas: 1  revisionHistoryLimit: 10  selector:    matchLabels:      app: grafana  template:    metadata:      labels:        app: grafana    spec:      containers:        - image: grafana/grafana          imagePullPolicy: Always          name: grafana          ports:            - containerPort: 3000              protocol: TCP          resources: {}      dnsPolicy: ClusterFirst      restartPolicy: Always      schedulerName: default-scheduler      securityContext: {}      terminationGracePeriodSeconds: 30

装置完之后，就能够应用grafana了，Grafana须要能获取到prometheus的数据，所以须要配置数据源data sources：

这时候就能够在外面创立监控看板了，并且在外面能够间接应用PromQL：

Exporter

大部分咱们应用的中间件都是通过独立模式部署的，比方我这里应用的rabbitmq:

apiVersion: apps/v1kind: Deploymentmetadata:  annotations:    deployment.kubernetes.io/revision: '3'  labels:    k8s-app: rabbitmq-exporter  name: rabbitmq-exporter  namespace: monitoringspec:  progressDeadlineSeconds: 600  replicas: 1  revisionHistoryLimit: 2  selector:    matchLabels:      k8s-app: rabbitmq-exporter  template:    metadata:      labels:        k8s-app: rabbitmq-exporter    spec:      containers:        - env:            - name: PUBLISH_PORT              value: '9098'            - name: RABBIT_CAPABILITIES              value: 'bert,no_sort'            - name: RABBIT_USER              value: xxxx            - name: RABBIT_PASSWORD              value: xxxx            - name: RABBIT_URL              value: 'http://ip:15672'          image: kbudde/rabbitmq-exporter          imagePullPolicy: IfNotPresent          name: rabbitmq-exporter          ports:            - containerPort: 9098              protocol: TCP

这里启动了一个rabbitmq-exporter服务，端口为9098，并且监听RabbitMQ的15672接口，获取其中的指标数据，转换成prometheus能够辨认的metrics；如果须要对业务进行监控，这时候就须要自定义监控了。

MicroMeter

SpringBoot自身提供了健康检查，度量，指标收集和监控，怎么把这些数据裸露给Prometheus，这就要用到Micrometer ，Micrometer为Java平台上的性能数据收集提供了一个通用的API，应用程序只须要应用Micrometer的通用API来收集性能指标即可。Micrometer会负责实现与不同监控零碎的适配工作。

增加依赖

<dependency>    <groupId>io.micrometer</groupId>    <artifactId>micrometer-registry-prometheus</artifactId></dependency>

增加上述依赖项之后，Spring Boot 将会主动配置 PrometheusMeterRegistry 和 CollectorRegistry来以Prometheus 能够抓取的格局收集和导出指标数据；

所有的相干数据，都会在Actuator 的 /prometheus端点裸露进去。Prometheus 能够抓取该端点以定期获取度量规范数据。

prometheus端点

启动SpringBoot服务，能够间接拜访http://ip:8080/actuator/prometheus地址，能够看到SpringBoot曾经提供了一些利用公共的监控数据比方jvm：

# HELP tomcat_sessions_created_sessions_total # TYPE tomcat_sessions_created_sessions_total countertomcat_sessions_created_sessions_total{application="springboot-actuator-prometheus-test",} 1782.0# HELP tomcat_sessions_active_current_sessions # TYPE tomcat_sessions_active_current_sessions gaugetomcat_sessions_active_current_sessions{application="springboot-actuator-prometheus-test",} 365.0# HELP jvm_threads_daemon_threads The current number of live daemon threads# TYPE jvm_threads_daemon_threads gaugejvm_threads_daemon_threads{application="springboot-actuator-prometheus-test",} 16.0# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process# TYPE process_cpu_usage gaugeprocess_cpu_usage{application="springboot-actuator-prometheus-test",} 0.0102880658436214# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next# TYPE jvm_gc_memory_allocated_bytes_total counterjvm_gc_memory_allocated_bytes_total{application="springboot-actuator-prometheus-test",} 9.13812704E8# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool# TYPE jvm_buffer_count_buffers gaugejvm_buffer_count_buffers{application="springboot-actuator-prometheus-test",id="mapped",} 0.0jvm_buffer_count_buffers{application="springboot-actuator-prometheus-test",id="direct",} 10.0...

prometheus配置target

在prometheus.yml中做如下配置：

- job_name: 'springboot-actuator-prometheus-test'  metrics_path: '/actuator/prometheus'  scrape_interval: 5s  basic_auth:    username: 'actuator'    password: 'actuator'  static_configs:    - targets: ['ip:8080']

增加完之后能够从新加载配置：

curl -X POST http:``//ip:9090/-/reload

再次查看prometheus的target：

Grafana

能够减少一个JVM的看板，如下所示：

业务埋点

Micrometer提供一系列原生的Meter，包含Timer , Counter , Gauge , DistributionSummary , LongTaskTimer等。不同的meter类型导致有不同的工夫序列指标值。例如，单个指标值用Gauge示意，计时事件的次数和总工夫用Timer示意；

Counter：容许以固定的数值递增，该数值必须为负数；
Gauge：获取以后值的句柄。典型的例子是，获取汇合、map、或运行中的线程数等；
Timer：Timer用于测量短时间延迟和此类事件的频率。所有Timer实现至多将总工夫和事件次数报告为独自的工夫序列；
LongTaskTimer：长工作计时器用于跟踪所有正在运行的长时间运行工作的总持续时间和此类工作的数量；
DistributionSummary：用于跟踪分布式的事件；

更多：Micrometer

总结

本文介绍了prometheus做监控服务的整个流程，从原理到实例，能够作为一个入门教程，然而prometheus弱小之处在于它提供的PromQL，这个能够依据需要本人去学习；还有就是Micrometer埋点接口其实对prometheus api(simpleclient)的包装，不便开发者去应用，能够依据需要去学习即可。