本文首发于泊浮目标简书:https://www.jianshu.com/u/204...

版本	日期	备注
1.0	2021.8.14	文章首发

0.背景

最近在为流解决组件接入监控，用了PushGateway（下文称为PGW），后果踩了不少坑，上来分享一下。

1.为什么是Push（PGW）

之前的实现pull，即在一个过程中裸露服务端口遵循Prometheous（下文简称Prom）的协定，让Prom来拉取数据。

但这有一个问题，须要调配端口。之前咱们团队用了很多麻烦的实现：分布式锁、多份状态存储等...但依然防止不了端口透露、节约的问题（拓扑高可用机制会导致它在不同的机器间偏移，那么之前调配的某机器端口就无用了）。只管咱们也能够去监控拓扑的生命周期，但这绝非易事——在较大的场景中，k级的拓扑是很失常的，然而要无效监控k级别的拓扑生命周期，仿佛又是个大的话题。

我的共事通知我k8s可能能够解决我的问题，在之后我也会尝试跟进这个技术栈的引入。

咱们仅仅想实现一个监控，并不想管其余有的没的事。

那么又到了陈词滥调的话题了，到底是push好还是pull好。我的观点是脱开场景讲道理就是耍流氓，到这个场景中，push更适合。

说到底，push要求被监控服务晓得监控零碎的地址，因而该信息须要设置在被监控服务中。因而被监控服务肯定水平上会依赖监控服务；而pull则要求监控零碎晓得所有被监控服务的地址，那么每减少一个被监控的服务，监控服务须要通过一些伎俩去感知到它——比方prom反对从服务发现零碎中动静获取指标服务，而flink反对通过port range来确认被监控服务所在的地位。

而对于其余的push和poll模型的比照，咱们能够查看上面的表格，依据本人的场景做出比照：

维度	推模型	拉模型
服务发现	较快。在启动时，agent可能主动发送数据。因而发现服务的速度与agent数量无关	较慢。须要通过定期扫描地址空间来发现新的服务，发现服务的速度与agent数量无关
可扩展性	较好。只须要部署agent，而agent个别也是无状态的	较差。监控零碎的工作量会随着agent数量线性回升
安全性	较好。不必侦挺网络连接，能够抵挡近程攻打	较差。可能面临近程拜访和拒绝服务攻打
操作复杂性	较好。只需感知轮询距离和监控零碎地址。防火墙须要配置为从代理到收集器的单向测量通信。	较差。监控零碎须要配置要轮询agent列表、拜访agent的平安凭据以及要检索的度量集。防火墙须要配置为容许轮询器和代理之间的双向通信。
提早	较好。推送的及时性较好。也有许多推送协定（如sFlow）都是在UDP之上实现的，提供了无阻塞、低提早的测量传输。	较差，实时性较低

而对于Prom官网中也介绍了绝大多数状况下PGW是不实用的，除了：

Usually, the only valid use case for the Pushgateway is for capturing the outcome of a service-level batch job. A "service-level" batch job is one which is not semantically related to a specific machine or job instance (for example, a batch job that deletes a number of users for an entire service). Such a job's metrics should not include a machine or instance label to decouple the lifecycle of specific machines or instances from the pushed metrics. This decreases the burden for managing stale metrics in the Pushgateway. See also the best practices for monitoring batch jobs.

在best practicse for monitoring batch jobs中，也有提到：

There is a fuzzy line between offline-processing and batch jobs, as offline processing may be done in batch jobs. Batch jobs are distinguished by the fact that they do not run continuously, which makes scraping them difficult.

包含在它的github仓库上是这样介绍：

The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus.

我的业务零碎确实存在ephemeral job(mean the jobs may not exist long enough)这也是我抉择PGW的重要起因。

2.踩了什么样的坑

原本接入PGW后，该有的数据都有了。后果测试同学在测试时发现忽然监控数据不见了，我一看，创立了大量拓扑做流式工作，PGW间接退出了，日志里有着out of memory的字样。

再测了两把，发现PGW随着拓扑的减少，耗费的内存和CPU也越来越厉害。

于是我想起官网里提到的：

The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.

然而我明明配置了了deleteOnShutdown 这个配置啊，官网里的解释是：Specifies whether to delete metrics from the PushGateway on shutdown.，但我从新运行PGW时，发现相干的mertics并没有删掉！

咱们团队通过了一番搜寻，发现了一个Issue：https://issues.apache.org/jir...

局部人感觉PGW应该做TTL的事，而PGW感觉这不是一个好的办法。而我感觉这是Flink本人应该fix的事，不晓得为什么没有修复，并且官网的文档中也没有提醒这个坑。

我往Flink文档里提了一个patch，心愿可能尽早合入吧——https://github.com/apache/fli...。

3. 小结

本文和大家分享了咱们团队在PushGateway与Flink联合时踩的坑，并探讨了咱们抉择PGW的初衷。之后我打算关注一下InfluxDB，以它来作为推入端代替PGW，我还留神到InfluxDB 新版的生态也挺不错，提供了面板、数据可视化以及告警，不再是单纯的时序数据库，联合本人的生态，会和prom+grafana越来越像。

3.1 参考资料

https://blog.sflow.com/2012/0...
https://prometheus.io/docs/pr...
https://prometheus.io/docs/pr...
https://ci.apache.org/project...