关于prometheus:n9e告警可高用的实现机制分析

n9e=nightingale

n9e监控告警框架，提供了监控绘图、监控告警以及告诉等一体的监控运维体系，在云原生时代，能够认为是Open-falcon的升级版。

一. 告警的数据流

指标存储：应用push模式

categraf采集后push给n9e-server；
n9e-server将指标值push给时序库prometheus；

指标告警：应用pull模式

由n9e-server应用PromQL向prometheus进行查问；
若查问到后果，示意告警被触发；

二. 告警实例的高可用

告警配置被存储在MySQL中，n9e-server从MySQL中读出，而后查问其中的PromQL，断定告警是否被触发。

1. 问题

相似于prometheus，当部署多个n9e-server实例后，若每个n9e-server都进行下面的告警断定过程，会导致每个n9e-server上都有告警产生。

prometheus是通过alertmanager进行告警去重的，那n9e-server是如何实现的呢？

2. 解决

n9e-server通过heartbeat+一致性hash，对告警规定进行分片，每个n9e-server只负责本人实例的告警规定的断定。

当n9e-server有实例退出时，已有的n9e-server实例会进行rehash，让新节点负责一部分告警规定；
当n9e-server有实例退出时，所有的n9e-server实例会进行rehash，退出节点的告警规定被分给存活的节点；

Heartbeat:

n9e-server实例启动一个heartbeat线程，定期的上报本人的状态并存储在MySQL中；
若发现过程内保留的实例列表与MySQL的实例列表不同，阐明产生了节点变动，则进行rehash；

一致性hash:

对某个datasource，比方prometheus-a1，有N个n9e-server注册进来，则将N个n9e-server组成一致性hash环；
对某个告警规定，应用rule.id进行一致性hash，命中到哪个n9e-server实例，就由其进行告警断定；

三. 源码剖析

实现代码次要在alert/naming/*.go。

在做告警断定时，对每个告警规定：

若rule不归本节点负责，则跳过；

// alert/eval/alert_rule.gofunc (s *Scheduler) syncAlertRules() {    ids := s.alertRuleCache.GetRuleIds()    alertRuleWorkers := make(map[string]*AlertRuleWorker)    externalRuleWorkers := make(map[string]*process.Processor)    for _, id := range ids {        rule := s.alertRuleCache.Get(id)        if rule == nil {            continue        }        if rule.IsPrometheusRule() {            datasourceIds := s.promClients.Hit(rule.DatasourceIdsJson)            for _, dsId := range datasourceIds {                // 断定rule是否归本节点负责                if !naming.DatasourceHashRing.IsHit(dsId, fmt.Sprintf("%d", rule.Id), s.aconf.Heartbeat.Endpoint) {                    continue                }                ...            }        }        ...    }    ...}

1. 一致性hash

一致性hash的key=rule.id，依据key判断该rule是否归本节点负责：

// alert/naming/hashring.gofunc (chr *DatasourceHashRingType) IsHit(datasourceId int64, pk string, currentNode string) bool {    node, err := chr.GetNode(datasourceId, pk)    ...    return node == currentNode}

依据key确定节点的过程：

对每个datasource，都有一个hash环：
先找到datasource对应的hash环；
再在hash环中找负责的节点；

// alert/naming/hashring.gofunc (chr *DatasourceHashRingType) GetNode(datasourceId int64, pk string) (string, error) {    chr.Lock()    defer chr.Unlock()    _, exists := chr.Rings[datasourceId]    if !exists {        ...    }    return chr.Rings[datasourceId].Get(pk)}

hash环的结构在heartbeat中。

2. heartbeat

n9e-server中开启一个goroutine，专门进行heartbeat：

默认1s执行1次；

// alert/naming/heartbeat.gofunc (n *Naming) loopHeartbeat() {    interval := time.Duration(n.heartbeatConfig.Interval) * time.Millisecond    for {        time.Sleep(interval)        if err := n.heartbeat(); err != nil {            logger.Warning(err)        }    }}

heartbeat的过程：

对每个datasource，查找其关联的最新的n9e-server列表=newss；
跟缓存的oldss比照，若有变动，则对server进行rehash，从新结构hash环；

// alert/naming/heartbeat.gofunc (n *Naming) heartbeat() error {    // 在页面上保护实例和集群的对应关系    datasourceIds, err = models.GetDatasourceIdsByEngineName(n.ctx, n.heartbeatConfig.EngineName)    for i := 0; i < len(datasourceIds); i++ {        servers, err := n.ActiveServers(datasourceIds[i])        ...        sort.Strings(servers)        newss := strings.Join(servers, " ")        oldss, exists := localss[datasourceIds[i]]        if exists && oldss == newss {            continue            // 没有变动        }        RebuildConsistentHashRing(datasourceIds[i], servers)    // rehash        localss[datasourceIds[i]] = newss    }    return nil}

查问datasource关联的最新n9e-server的过程：

实际上就查问alerting_engines表，找指标datasource对应的记录，要求30s内有心跳；

// alert/naming/heartbeat.gofunc (n *Naming) ActiveServers(datasourceId int64) ([]string, error) {    ...    // 30秒内有心跳，就认为是活的    return models.AlertingEngineGetsInstances(n.ctx, "datasource_id = ? and clock > ?", datasourceId, time.Now().Unix()-30)}func AlertingEngineGetsInstances(ctx *ctx.Context, where string, args ...interface{}) ([]string, error) {    var arr []string    var err error    session := DB(ctx).Model(new(AlertingEngines)).Order("instance")    if where == "" {        err = session.Pluck("instance", &arr).Error    } else {        err = session.Where(where, args...).Pluck("instance", &arr).Error    }    return arr, err}

mysql> select * from alerting_engines;+----+----------------------+---------------+----------------+------------+| id | instance             | datasource_id | engine_cluster | clock      |+----+----------------------+---------------+----------------+------------+|  2 | 192.168.100.20:17000 |      99999999 | default        | 1690537360 ||  3 | 192.168.100.20:17000 |             1 | default        | 1690537360 ||  4 | 192.168.100.20:17000 |             2 | default        | 1690537360 |+----+----------------------+---------------+----------------+------------+3 rows in set (0.00 sec)

n9e-server实例，在heartbeat的过程中，会将本人注册进去：

// alert/naming/heartbeat.gofunc (n *Naming) heartbeat() error {    ...    for i := 0; i < len(datasourceIds); i++ {        err := models.AlertingEngineHeartbeatWithCluster(n.ctx, n.heartbeatConfig.Endpoint, n.heartbeatConfig.EngineName, datasourceIds[i])        if err != nil {            logger.Warningf("heartbeat with cluster %d err:%v", datasourceIds[i], err)        }    }    ...    }func AlertingEngineHeartbeatWithCluster(ctx *ctx.Context, instance, cluster string, datasourceId int64) error {    ...    var total int64    err := DB(ctx).Model(new(AlertingEngines)).Where("instance=? and engine_cluster = ? and datasource_id=?", instance, cluster, datasourceId).Count(&total).Error    ...    // 没有记录，则insert    if total == 0 {        // insert        err = DB(ctx).Create(&AlertingEngines{            Instance:      instance,            DatasourceId:  datasourceId,            EngineCluster: cluster,            Clock:         time.Now().Unix(),        }).Error    } else { // 否则，update最新工夫        // updates        fields := map[string]interface{}{"clock": time.Now().Unix()}        err = DB(ctx).Model(new(AlertingEngines)).Where("instance=? and engine_cluster = ? and datasource_id=?", instance, cluster, datasourceId).Updates(fields).Error    }    return err}

参考：

https://flashcat.cloud/docs/content/flashcat-monitor/nighting...
https://answer.flashcat.cloud/questions/10010000000002963/100...