需要

收集 ES 的指标, 并进行展现和告警;

现状

  1. ES 通过 docker compose 装置
  2. 所在环境的 K8S 集群有 Prometheus 和 AlertManager 及 Grafana

计划

复用现有的监控体系, 通过: Prometheus 监控 ES.

→: Prometheus 监控 ES 架构

具体实现为:

采集端 elasticsearch_exporter

能够监控的指标为:

NameTypeCardinalityHelp
elasticsearch_breakers_estimated_size_bytesgauge4Estimated size in bytes of breaker
elasticsearch_breakers_limit_size_bytesgauge4Limit size in bytes for breaker
elasticsearch_breakers_trippedcounter4tripped for breaker
elasticsearch_cluster_health_active_primary_shardsgauge1The number of primary shards in your cluster. This is an aggregate total across all indices.
elasticsearch_cluster_health_active_shardsgauge1Aggregate total of all shards across all indices, which includes replica shards.
elasticsearch_cluster_health_delayed_unassigned_shardsgauge1Shards delayed to reduce reallocation overhead
elasticsearch_cluster_health_initializing_shardsgauge1Count of shards that are being freshly created.
elasticsearch_cluster_health_number_of_data_nodesgauge1Number of data nodes in the cluster.
elasticsearch_cluster_health_number_of_in_flight_fetchgauge1The number of ongoing shard info requests.
elasticsearch_cluster_health_number_of_nodesgauge1Number of nodes in the cluster.

...

展现端 基于Grafana

Reference:

ElasticSearch dashboard for Grafana | Grafana Labs

告警指标 基于prometheus alertmanager

Reference:

ElasticSearch:https://awesome-prometheus-al...

施行步骤

以下为手动施行步骤

Docker Compose

docker pull quay.io/prometheuscommunity/elasticsearch-exporter:v1.3.0

docker-compose.yml 示例:

Warning:

exporter 在每次刮削时都会从 ElasticSearch 集群中获取信息,因而过短的刮削距离会给 ES 主节点带来负载,特地是当你应用 --es.all--es.indices 运行时。咱们倡议你测量获取/_nodes/stats/_all/_stats对你的ES集群来说须要多长时间,以确定你的刮削距离是否太短。

原 ES 的 docker-copmose.yml 示例如下:

version: '3'services:  elasticsearch:    image: elasticsearch-plugins:6.8.18    ...    ports:      - 9200:9200      - 9300:9300    restart: always

减少了 elasticsearch_exporter 的yaml如下:

version: '3'services:  elasticsearch:    image: elasticsearch-plugins:6.8.18    ...    ports:      - 9200:9200      - 9300:9300    restart: always  elasticsearch_exporter:      image: quay.io/prometheuscommunity/elasticsearch-exporter:v1.3.0      command:       - '--es.uri=http://elasticsearch:9200'      - '--es.all'      - '--es.indices'      - '--es.indices_settings'      - '--es.indices_mappings'      - '--es.shards'      - '--es.snapshots'      - '--es.timeout=30s'            restart: always      ports:      - "9114:9114"    

Prometheus 配置调整

prometheus 配置

Prometheus 减少动态抓取配置:

scrape_configs:  - job_name: "es"    static_configs:      - targets: ["x.x.x.x:9114"]

阐明:

x.x.x.x 为 ES Exporter IP, 因为 ES Exporter 通过 docker compose 和 ES部署在同一台机器, 所以这个 IP 也是 ES 的IP.

Prometheus Rules

减少 ES 相干的 Prometheus Rules:

groups:  - name: elasticsearch    rules:      - record: elasticsearch_filesystem_data_used_percent        expr: 100 * (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes)          / elasticsearch_filesystem_data_size_bytes      - record: elasticsearch_filesystem_data_free_percent        expr: 100 - elasticsearch_filesystem_data_used_percent      - alert: ElasticsearchTooFewNodesRunning        expr: elasticsearch_cluster_health_number_of_nodes < 3        for: 0m        labels:          severity: critical        annotations:          description: "Missing node in Elasticsearch cluster\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"          summary: ElasticSearch running on less than 3 nodes(instance {{ $labels.instance }}, node {{$labels.node}})      - alert: ElasticsearchDiskSpaceLow        expr: elasticsearch_filesystem_data_free_percent < 20        for: 2m        labels:          severity: warning        annotations:          summary: Elasticsearch disk space low (instance {{ $labels.instance }}, node {{$labels.node}})          description: "The disk usage is over 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchDiskOutOfSpace        expr: elasticsearch_filesystem_data_free_percent < 10        for: 0m        labels:          severity: critical        annotations:          summary: Elasticsearch disk out of space (instance {{ $labels.instance }}, node {{$labels.node}})          description: "The disk usage is over 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchHeapUsageWarning        expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80        for: 2m        labels:          severity: warning        annotations:          summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }}, node {{$labels.node}})          description: "The heap usage is over 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchHeapUsageTooHigh        expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90        for: 2m        labels:          severity: critical        annotations:          summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }}, node {{$labels.node}})          description: "The heap usage is over 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchClusterRed        expr: elasticsearch_cluster_health_status{color="red"} == 1        for: 0m        labels:          severity: critical        annotations:          summary: Elasticsearch Cluster Red (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Elastic Cluster Red status\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchClusterYellow        expr: elasticsearch_cluster_health_status{color="yellow"} == 1        for: 0m        labels:          severity: warning        annotations:          summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Elastic Cluster Yellow status\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchHealthyDataNodes        expr: elasticsearch_cluster_health_number_of_data_nodes < 3        for: 0m        labels:          severity: critical        annotations:          summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Missing data node in Elasticsearch cluster\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchRelocatingShards        expr: elasticsearch_cluster_health_relocating_shards > 0        for: 0m        labels:          severity: info        annotations:          summary: Elasticsearch relocating shards (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Elasticsearch is relocating shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchRelocatingShardsTooLong        expr: elasticsearch_cluster_health_relocating_shards > 0        for: 15m        labels:          severity: warning        annotations:          summary: Elasticsearch relocating shards too long (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Elasticsearch has been relocating shards for 15min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchInitializingShards        expr: elasticsearch_cluster_health_initializing_shards > 0        for: 0m        labels:          severity: info        annotations:          summary: Elasticsearch initializing shards (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Elasticsearch is initializing shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchInitializingShardsTooLong        expr: elasticsearch_cluster_health_initializing_shards > 0        for: 15m        labels:          severity: warning        annotations:          summary: Elasticsearch initializing shards too long (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Elasticsearch has been initializing shards for 15 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchUnassignedShards        expr: elasticsearch_cluster_health_unassigned_shards > 0        for: 0m        labels:          severity: critical        annotations:          summary: Elasticsearch unassigned shards (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Elasticsearch has unassigned shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchPendingTasks        expr: elasticsearch_cluster_health_number_of_pending_tasks > 0        for: 15m        labels:          severity: warning        annotations:          summary: Elasticsearch pending tasks (instance {{ $labels.instance }}, node {{$labels.node}})          description: "Elasticsearch has pending tasks. Cluster works slowly.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      - alert: ElasticsearchNoNewDocuments        expr: increase(elasticsearch_indices_docs{es_data_node="true"}[10m]) < 1        for: 0m        labels:          severity: warning        annotations:          summary: Elasticsearch no new documents (instance {{ $labels.instance }}, node {{$labels.node}})          description: "No new documents for 10 min!\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

并重启失效.

Warning:

  • ElasticsearchTooFewNodesRunning 告警的条件是 es 集群的node 少于 3个, 对于单节点 ES 会误报, 所以按需开启rule或按需屏蔽(slience).
  • ElasticsearchHealthyDataNodes 告警同上.

AlertManager 告警规定及收件人配置

按需调整, 示例如下:

'global':  'smtp_smarthost': ''  'smtp_from': ''  'smtp_require_tls': false  'resolve_timeout': '5m''receivers':  - 'name': 'es-email'    'email_configs':      - 'to': 'sfw@example.com,sdfwef@example.com'        'send_resolved': true'route':  'group_by':    - 'job'  'group_interval': '5m'  'group_wait': '30s'  'routes':    - 'receiver': 'es-email'      'match':        'job': 'es'

并重启失效.

Grafana 配置

导入 json 格局的 Grafana Dashboard:

{    "__inputs": [],    "__requires": [        {            "type": "grafana",            "id": "grafana",            "name": "Grafana",            "version": "5.4.0"        },        {            "type": "panel",            "id": "graph",            "name": "Graph",            "version": "5.0.0"        },        {            "type": "datasource",            "id": "prometheus",            "name": "Prometheus",            "version": "5.0.0"        },        {            "type": "panel",            "id": "singlestat",            "name": "Singlestat",            "version": "5.0.0"        }    ],...

️ 参考文档

  • prometheus-community/elasticsearch_exporter: Elasticsearch stats exporter for Prometheus (github.com)
  • ElasticSearch dashboard for Grafana | Grafana Labs
  • Awesome Prometheus alerts | Collection of alerting rules (grep.to)
三人行, 必有我师; 常识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.