乐趣区

通过ConsulPrometheus自动注册nodeexporter实现自动监控OpenStack的VM

1. 提出问题

在工作中 OpenStack 集群的 vm 需要解决基础性能指标的监控,如果每台的启动再去手动添加监控 node_exporter,再写 prometheus.yml 的话对于吾等懒程序员简直就是噩梦,由此开始设计基于 Prometheus+Consul 的监控方案。

2. 解决方案

1. 通过将 node_exporter 打包进 Image 实现强制自动部署
2. 通过开发一个小程序自动注册 node_exporter 到 consul,同时小程序也与 node_exporter 一样打包进 Image
3. 配置 Prometheus 发现 node_exporter

3. 部署 Consul 集群

3.1 集群规划

系统 主机名 IP
Centos-7.7 compute-7-1 172.16.100.71
Centos-7.7 compute-7-2 172.16.100.72
Centos-7.7 compute-7-3 172.16.100.73

3.1 自行下载 Consul 并安装

Consul v1.7.2

3.1.1 配置 master token

$ curl \
    --request PUT \
    http://172.16.100.71:8500/v1/acl/bootstrap

3.1.2 配置获取到的 master token

compute-7-1:

{
    "bootstrap_expect": 1,
    "datacenter": "sibat_consul",
    "primary_datacenter":"sibat_consul",
    "data_dir": "/data/consul",
    "start_join":[
        "172.16.100.72",
        "172.16.100.73"
    ],
    "retry_join":[
        "172.16.100.72",
        "172.16.100.73"
    ],
    "connect":{"enabled": true},
    "server": true,
    "client_addr": "0.0.0.0",
    "ui": true,
    "node_name": "compute-7-1",
    "bind_addr": "172.16.100.71",
    "advertise_addr": "172.16.100.71",
    "enable_script_checks": false,
    "enable_local_script_checks": true,
    "log_file": "/var/log",
    "log_rotate_bytes": 300000000,
    "log_rotate_duration": "360h",
    "log_level": "info",
    "encrypt": "gEjZMbDxnA5UDS5DJRI3Nn5KvOwdVa46jneHK0gFDa8=",
    "acl": {
        "enabled": true,
        "default_policy": "deny",
        "enable_token_persistence": true,
        "tokens": {"master": "8dc1eb67-1f5f-4e10-ad9d-5e58b047647c"}
    }
}

compute-7-2

{
    "datacenter": "sibat_consul",
    "primary_datacenter":"sibat_consul",
    "data_dir": "/data/consul",
    "connect":{"enabled": true},
    "server": true,
    "client_addr": "0.0.0.0",
    "ui": true,
    "node_name": "compute-7-2",
    "bind_addr": "172.16.100.72",
    "advertise_addr": "172.16.100.72",
    "enable_script_checks": false,
    "enable_local_script_checks": true,
    "log_file": "/var/log",
    "log_rotate_bytes": 300000000,
    "log_rotate_duration": "360h",
    "log_level": "info",
    "acl_datacenter": "sibat_consul",
    "encrypt": "gEjZMbDxnA5UDS5DJRI3Nn5KvOwdVa46jneHK0gFDa8=",
    "acl": {
        "enabled": true,
        "default_policy": "deny",
        "enable_token_persistence": true,
        "tokens": {"master": "8dc1eb67-1f5f-4e10-ad9d-5e58b047647c"}
    }
}

compute-7-3

{
    "datacenter": "sibat_consul",
    "primary_datacenter":"sibat_consul",
    "data_dir": "/data/consul",
    "connect":{"enabled": true},
    "server": true,
    "client_addr": "0.0.0.0",
    "ui": true,
    "node_name": "compute-7-3",
    "bind_addr": "172.16.100.73",
    "advertise_addr": "172.16.100.73",
    "enable_script_checks": false,
    "enable_local_script_checks": true,
    "log_file": "/var/log",
    "log_rotate_bytes": 300000000,
    "log_rotate_duration": "360h",
    "log_level": "info",
    "acl_datacenter": "sibat_consul",
    "encrypt": "gEjZMbDxnA5UDS5DJRI3Nn5KvOwdVa46jneHK0gFDa8=",
    "acl": {
        "enabled": true,
        "default_policy": "deny",
        "enable_token_persistence": true,
        "tokens": {"master": "8dc1eb67-1f5f-4e10-ad9d-5e58b047647c"}
    }
}

在三个节点中启动

3.1.3 三个节点都执行

$ sudo useradd consul
$ sudo vim /usr/lib/systemd/system/consul.service
Description=consul: the monitoring system
Documentation=http://prometheus.io/docs/

[Service]
User=consul
Group=consul
ExecStart=/usr/bin/consul agent -config-file /etc/consul.d/consul_config.json
KillMode=process
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
$ sudo systemctl daemon-reload

3.1.4 在 compute-7- 2 和 compute-7- 3 执行

$ sudo systemctl restart consul && sudo systemctl enable consul

3.1.5 在 compute-7- 3 执行

$ sudo systemctl restart consul && sudo systemctl enable consul

启动后我们会查看到服务器日志中出现与权限有关的错误,根据官方文档的说法是因为未配置 agent 的 token 导致的,因此还需要创建 agent 的 token:

$ curl \
    --request PUT \
    --header "X-Consul-Token: 8dc1eb67-1f5f-4e10-ad9d-5e58b047647c" \
    --data \
    '{"Name":"Agent Token","Type":"client","Rules":"node \"\" { policy = \"write\"} service \"\" {policy = \"read\"}"}'http://172.16.100.71:8500/v1/acl/create

3.1.6 配置获取到的 agent token

compute-7-1:

{
    "bootstrap_expect": 1,
    "datacenter": "sibat_consul",
    "primary_datacenter":"sibat_consul",
    "data_dir": "/data/consul",
    "start_join":[
        "172.16.100.72",
        "172.16.100.73"
    ],
    "retry_join":[
        "172.16.100.72",
        "172.16.100.73"
    ],
    "connect":{"enabled": true},
    "server": true,
    "client_addr": "0.0.0.0",
    "ui": true,
    "node_name": "compute-7-1",
    "bind_addr": "172.16.100.71",
    "advertise_addr": "172.16.100.71",
    "enable_script_checks": false,
    "enable_local_script_checks": true,
    "log_file": "/var/log",
    "log_rotate_bytes": 300000000,
    "log_rotate_duration": "360h",
    "log_level": "info",
    "encrypt": "gEjZMbDxnA5UDS5DJRI3Nn5KvOwdVa46jneHK0gFDa8=",
    "acl": {
        "enabled": true,
        "default_policy": "deny",
        "enable_token_persistence": true,
        "tokens": {
            "master": "8dc1eb67-1f5f-4e10-ad9d-5e58b047647c",
            "agent": "883efc94-0c59-c46f-67cf-4644ac4adad2"
        }
    }
}

compute-7-2

{
    "datacenter": "sibat_consul",
    "primary_datacenter":"sibat_consul",
    "data_dir": "/data/consul",
    "connect":{"enabled": true},
    "server": true,
    "client_addr": "0.0.0.0",
    "ui": true,
    "node_name": "compute-7-2",
    "bind_addr": "172.16.100.72",
    "advertise_addr": "172.16.100.72",
    "enable_script_checks": false,
    "enable_local_script_checks": true,
    "log_file": "/var/log",
    "log_rotate_bytes": 300000000,
    "log_rotate_duration": "360h",
    "log_level": "info",
    "acl_datacenter": "sibat_consul",
    "encrypt": "gEjZMbDxnA5UDS5DJRI3Nn5KvOwdVa46jneHK0gFDa8=",
    "acl": {
        "enabled": true,
        "default_policy": "deny",
        "enable_token_persistence": true,
        "tokens": {
            "master": "8dc1eb67-1f5f-4e10-ad9d-5e58b047647c",
            "agent": "883efc94-0c59-c46f-67cf-4644ac4adad2"
        }
    }
}

compute-7-3

{
    "datacenter": "sibat_consul",
    "primary_datacenter":"sibat_consul",
    "data_dir": "/data/consul",
    "connect":{"enabled": true},
    "server": true,
    "client_addr": "0.0.0.0",
    "ui": true,
    "node_name": "compute-7-3",
    "bind_addr": "172.16.100.73",
    "advertise_addr": "172.16.100.73",
    "enable_script_checks": false,
    "enable_local_script_checks": true,
    "log_file": "/var/log",
    "log_rotate_bytes": 300000000,
    "log_rotate_duration": "360h",
    "log_level": "info",
    "acl_datacenter": "sibat_consul",
    "encrypt": "gEjZMbDxnA5UDS5DJRI3Nn5KvOwdVa46jneHK0gFDa8=",
    "acl": {
        "enabled": true,
        "default_policy": "deny",
        "enable_token_persistence": true,
        "tokens": {
            "master": "8dc1eb67-1f5f-4e10-ad9d-5e58b047647c",
            "agent": "883efc94-0c59-c46f-67cf-4644ac4adad2"
        }
    }
}

3.1.7 在 compute-7- 2 和 compute-7- 3 执行

$ sudo systemctl restart consul && sudo systemctl enable consul

3.1.8 在 compute-7- 3 执行

$ sudo systemctl restart consul && sudo systemctl enable consul

待集群稳定后即可访问 UI,http://172.16.100.71:8500

4. 集成 Prometheus

$ sudo vim /etc/prometheus/prometheus.yml
...
  - job_name: 'OpenStack-vms'
    consul_sd_configs:
      - server: "172.16.100.71:8500"
        token: '8dc1eb67-1f5f-4e10-ad9d-5e58b047647c'
        services: []
      - server: "172.16.100.72:8500"
        token: '8dc1eb67-1f5f-4e10-ad9d-5e58b047647c'
        services: []
      - server: "172.16.100.73:8500"
        token: '8dc1eb67-1f5f-4e10-ad9d-5e58b047647c'
        services: []
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: ".*OpenStack-vms.*"
        replacement: OpenStack-vms
        action: keep
        target_label: env
      - regex: __meta_consul_service_metadata_(.+)
        action: labelmap
...
$ sudo systemctl restart prometheus

启动后,在 prometheus UI 就可以找到刚才配置的 job_name 了:

5. VMS 自动注册

问题:关于自动注册,原生的组件中都没有较美好的方案。我刚开始使用 curl 的方式通过 shell 写入 rc.local 的方式自动注册,但是发现有时还是会出现没有注册的情况。同时发现 consul 并不是强一致性的注册中心,有时会出现相同的 serviceid 同时被注册到不同的节点的情况:

所以使用 go 语言开发了一个小程序自动注册 node_exporter,并使用 systemd 设置开机自启动来达到自动注册的效果,并通过一套算法来避免重复注册以及实现均衡注册。

$ wget https://github.com/FrankenFuncc/consul-registy-service/releases/download/202006161758/consulR.zip
$ unzip consulR.zip
$ wget https://github.com/prometheus/node_exporter/releases/download/v1.0.0/node_exporter-1.0.0.linux-amd64.tar.gz
$ tar -zxvf node_exporter-1.0.0.linux-amd64.tar.gz -C /usr/local/
$ mv /usr/local/node_exporter-1.0.0.linux-amd64.tar.gz /usr/local/node_exporter

Node_Exporter 安装与开机自启动

$ vim 
[Unit]
Description=node_exporter: the monitoring system
Documentation=http://prometheus.io/docs/

[Service]
ExecStart=/usr/local/node_exporter/node_exporter
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target
$ systemctl daemon-reload && systemctl start node_exporter && systemctl enable node_exporter

Consul 安装与开机自启动

$ vim /etc/consul/consul.yaml
System:
  ServiceName: consul-registy-service
  ListenAddress: 0.0.0.0
  Port: 9984
  #通过此 IP 与端口来检索出口网卡 IP 地址
  FindAddress: 8.8.8.8:80
Logs:
  LogFilePath: /data/consul/consul.log
  LogLevel: info
Consul:
  Address: 172.16.100.71:8500,172.16.100.72:8500,172.16.100.73:8500
  Token: 8dc1eb67-1f5f-4e10-ad9d-5e58b047647c
  CheckTimeout: 5s
  CheckInterval: 5s
  CheckDeregisterCriticalServiceAfter: true
  CheckDeregisterCriticalServiceAfterTime: 5s
Service:
  Tag: node-exporter
  #Address 空则默认通过 FindAddress 配置来检索出口网卡 IP 地址
  Address:
  Port: 9100
$ vim /usr/lib/systemd/system/consul.service 
[Unit]
Description=Consul
After=network-online.target

[Service]
User=nobody
ExecStart=/usr/local/consul --confpath=/etc/consul/consul.yaml
Restart=on-failure
RestartSec=1

[Install]
WantedBy=multi-user.target
$ systemctl daemon-reload && systemctl start consul && systemctl enable consul

创建镜像后,用这个镜像就能被 prometheus 自动发现了。

退出移动版