关于云计算:Prometheus-Operator-教程根据服务维度对-Prometheus-分片

原文链接：https://fuckcloudnative.io/posts/aggregate-metrics-user-prometheus-operator/

Promtheus 自身只反对单机部署，没有自带反对集群部署，也不反对高可用以及程度扩容，它的存储空间受限于本地磁盘的容量。同时随着数据采集量的减少，单台 Prometheus 实例可能解决的工夫序列数会达到瓶颈，这时 CPU 和内存都会升高，个别内存先达到瓶颈，次要起因有：

Prometheus 的内存耗费次要是因为每隔 2 小时做一个 Block 数据落盘，落盘之前所有数据都在内存外面，因而和采集量无关。
加载历史数据时，是从磁盘到内存的，查问范畴越大，内存越大。这外面有肯定的优化空间。
一些不合理的查问条件也会加大内存，如 Group 或大范畴 Rate。

这个时候要么加内存，要么通过集群分片来缩小每个实例须要采集的指标。本文就来探讨通过 Prometheus Operator 部署的 Prometheus 如何依据服务维度来拆分实例。

1. 依据服务维度拆分 Prometheus

Prometheus 主张依据性能或服务维度进行拆分，即如果要采集的服务比拟多，一个 Prometheus 实例就配置成仅采集和存储某一个或某一部分服务的指标，这样依据要采集的服务将 Prometheus 拆分成多个实例别离去采集，也能肯定水平上达到程度扩容的目标。

在 Kubernetes 集群中，咱们能够依据 namespace 来拆分 Prometheus 实例，例如将所有 Kubernetes 集群组件相干的监控发送到一个 Prometheus 实例，将其余所有监控发送到另一个 Prometheus 实例。

Prometheus Operator 通过 CRD 资源名 Prometheus 来管制 Prometheus 实例的部署，其中能够通过在配置项 serviceMonitorNamespaceSelector 和 podMonitorNamespaceSelector 中指定标签来限定抓取 target 的 namespace。例如，将 namespace kube-system 打上标签 monitoring-role=system，将其余的 namespace 打上标签 monitoring-role=others。

2. 告警规定拆分

将 Prometheus 拆分成多个实例之后，就不能再应用默认的告警规定了，因为默认的告警规定是针对所有 target 的监控指标的，每一个 Prometheus 实例都无奈获取所有 target 的监控指标，势必会始终报警。为了解决这个问题，须要对告警规定进行拆分，使其与每个 Prometheus 实例的服务维度一一对应，依照上文的拆分逻辑，这里只须要拆分成两个告警规定，打上不同的标签，而后在 CRD 资源 Prometheus 中通过配置项 ruleSelector 指定规定标签来抉择相应的告警规定。

3. 集中数据存储

解决了告警问题之后，还有一个问题，当初监控数据比拟扩散，应用 Grafana 查问监控数据时咱们也须要增加许多数据源，而且不同数据源之间的数据还不能聚合查问，监控页面也看不到全局的视图，造成查问凌乱的场面。

为了解决这个问题，咱们能够让 Prometheus 不负责存储数据，只将采集到的样本数据通过 Remote Write 的形式写入近程存储的 Adapter，而后将 Grafana 的数据源设为近程存储的地址，就能够在 Grafana 中查看全局视图了。这里抉择 VictoriaMetrics 来作为近程存储。VictoriaMetrics 是一个高性能，低成本，可扩大的时序数据库，能够用来做 Prometheus 的长期存储，分为单机版本和集群版本，均已开源。如果数据写入速率低于每秒一百万个数据点，官网倡议应用单节点版本而不是集群版本。本文作为演示，仅应用单机版本，架构如图：

4. 实际

确定好了计划之后，上面来进行入手实际。

部署 VictoriaMetrics

首先部署一个单实例的 VictoriaMetrics，残缺的 yaml 如下：

kind: PersistentVolumeClaimapiVersion: v1metadata:  name: victoriametrics  namespace: kube-systemspec:  accessModes:    - ReadWriteOnce  resources:    requests:      storage: 100Gi---apiVersion: apps/v1kind: StatefulSetmetadata:  labels:    app: victoriametrics  name: victoriametrics  namespace: kube-systemspec:  serviceName: pvictoriametrics  selector:    matchLabels:      app: victoriametrics  replicas: 1  template:    metadata:      labels:        app: victoriametrics    spec:      nodeSelector:        blog: "true"      containers:          - args:        - --storageDataPath=/storage        - --httpListenAddr=:8428        - --retentionPeriod=1        image: victoriametrics/victoria-metrics        imagePullPolicy: IfNotPresent        name: victoriametrics        ports:        - containerPort: 8428          protocol: TCP        readinessProbe:          httpGet:            path: /health            port: 8428          initialDelaySeconds: 30          timeoutSeconds: 30        livenessProbe:          httpGet:            path: /health            port: 8428          initialDelaySeconds: 120          timeoutSeconds: 30        resources:          limits:            cpu: 2000m            memory: 2000Mi          requests:            cpu: 2000m            memory: 2000Mi        volumeMounts:        - mountPath: /storage          name: storage-volume      restartPolicy: Always      priorityClassName: system-cluster-critical      volumes:      - name: storage-volume        persistentVolumeClaim:          claimName: victoriametrics---apiVersion: v1kind: Servicemetadata:  labels:    app: victoriametrics  name: victoriametrics  namespace: kube-systemspec:  ports:  - name: http    port: 8428    protocol: TCP    targetPort: 8428  selector:    app: victoriametrics  type: ClusterIP

有几个启动参数须要留神：

storageDataPath : 数据目录的门路。 VictoriaMetrics 将所有数据存储在此目录中。
retentionPeriod : 数据的保留期限（以月为单位）。旧数据将主动删除。默认期限为1个月。
httpListenAddr : 用于监听 HTTP 申请的 TCP 地址。默认状况下，它在所有网络接口上监听端口 8428。

给 namespace 打标签

为了限定抓取 target 的 namespace，咱们须要给 namespace 打上标签，使每个 Prometheus 实例只抓取特定 namespace 的指标。依据上文的计划，须要给 kube-system 打上标签 monitoring-role=system：

$ kubectl label ns kube-system monitoring-role=system

给其余的 namespace 打上标签 monitoring-role=others。例如：

$ kubectl label ns monitoring monitoring-role=others$ kubectl label ns default monitoring-role=others

拆分 PrometheusRule

告警规定须要依据监控指标拆分成两个 PrometheusRule。具体做法是将 kube-system namespace 相干的规定整合到一个 PrometheusRule 中，并批改名称和标签：

# prometheus-rules-system.yamlapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata:  labels:    prometheus: system    role: alert-rules  name: prometheus-system-rules  namespace: monitoringspec:  groups:......

剩下的放到另外一个 PrometheusRule 中，并批改名称和标签：

# prometheus-rules-others.yamlapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata:  labels:    prometheus: others    role: alert-rules  name: prometheus-others-rules  namespace: monitoringspec:  groups:......

而后删除默认的 PrometheusRule：

$ kubectl -n monitoring delete prometheusrule prometheus-k8s-rules

新增两个 PrometheusRule：

$ kubectl apply -f prometheus-rules-system.yaml$ kubectl apply -f prometheus-rules-others.yaml

如果你切实不晓得如何拆分规定，或者不想拆分，想做一个伸手党，能够看这里：

prometheus-rules-system.yaml
prometheus-rules-others.yaml

拆分 Prometheus

下一步是拆分 Prometheus 实例，依据下面的计划须要拆分成两个实例，一个用来监控 kube-system namespace，另一个用来监控其余 namespace：

# prometheus-prometheus-system.yamlapiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata:  labels:    prometheus: system   name: system  namespace: monitoringspec:  remoteWrite:    - url: http://victoriametrics.kube-system.svc.cluster.local:8428/api/v1/write      queueConfig:        maxSamplesPerSend: 10000  retention: 2h   alerting:    alertmanagers:    - name: alertmanager-main      namespace: monitoring      port: web  image: quay.io/prometheus/prometheus:v2.17.2  nodeSelector:    beta.kubernetes.io/os: linux  podMonitorNamespaceSelector:    matchLabels:      monitoring-role: system   podMonitorSelector: {}  replicas: 1   resources:    requests:      memory: 400Mi    limits:      memory: 2Gi  ruleSelector:    matchLabels:      prometheus: system       role: alert-rules  securityContext:    fsGroup: 2000    runAsNonRoot: true    runAsUser: 1000  serviceAccountName: prometheus-k8s  serviceMonitorNamespaceSelector:     matchLabels:      monitoring-role: system   serviceMonitorSelector: {}  version: v2.17.2---apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata:  labels:    prometheus: others  name: others  namespace: monitoringspec:  remoteWrite:    - url: http://victoriametrics.kube-system.svc.cluster.local:8428/api/v1/write      queueConfig:        maxSamplesPerSend: 10000  retention: 2h  alerting:    alertmanagers:    - name: alertmanager-main      namespace: monitoring      port: web  image: quay.io/prometheus/prometheus:v2.17.2  nodeSelector:    beta.kubernetes.io/os: linux  podMonitorNamespaceSelector:     matchLabels:      monitoring-role: others   podMonitorSelector: {}  replicas: 1  resources:    requests:      memory: 400Mi    limits:      memory: 2Gi  ruleSelector:    matchLabels:      prometheus: others       role: alert-rules  securityContext:    fsGroup: 2000    runAsNonRoot: true    runAsUser: 1000  serviceAccountName: prometheus-k8s  serviceMonitorNamespaceSelector:    matchLabels:      monitoring-role: others   serviceMonitorSelector: {}  additionalScrapeConfigs:    name: additional-scrape-configs    key: prometheus-additional.yaml  version: v2.17.2

须要留神的配置：

通过 remoteWrite 指定 remote write 写入的近程存储。
通过 ruleSelector 指定 PrometheusRule。
限度内存应用下限为 2Gi，可依据理论状况自行调整。
通过 retention 指定数据在本地磁盘的保留工夫为 2 小时。因为指定了近程存储，本地不须要保留那么长时间，尽量缩短。
Prometheus 的自定义配置能够通过 additionalScrapeConfigs 在 others 实例中指定，当然你也能够持续拆分，放到其余实例中。

删除默认的 Prometheus 实例：

$ kubectl -n monitoring delete prometheus k8s

创立新的 Prometheus 实例：

$ kubectl apply -f prometheus-prometheus.yaml

查看运行状况：

$ kubectl -n monitoring get prometheusNAME     VERSION   REPLICAS   AGEsystem   v2.17.2   1          29hothers   v2.17.2   1          29h$ kubectl -n monitoring get stsNAME                READY   AGEprometheus-system   1/1     29hprometheus-others   1/1     29halertmanager-main   1/1     25d

查看每个 Prometheus 实例的内存占用：

$ kubectl -n monitoring top pod -l app=prometheusNAME                  CPU(cores)   MEMORY(bytes)prometheus-others-0   12m          110Miprometheus-system-0   121m         1182Mi

最初还要批改 Prometheus 的 Service，yaml 如下：

apiVersion: v1kind: Servicemetadata:  labels:    prometheus: system   name: prometheus-system  namespace: monitoringspec:  ports:  - name: web    port: 9090    targetPort: web  selector:    app: prometheus    prometheus: system  sessionAffinity: ClientIP---apiVersion: v1kind: Servicemetadata:  labels:    prometheus: others  name: prometheus-others  namespace: monitoringspec:  ports:  - name: web    port: 9090    targetPort: web  selector:    app: prometheus    prometheus: others  sessionAffinity: ClientIP

删除默认的 Service：

$ kubectl -n monitoring delete svc prometheus-k8s

创立新的 Service：

$ kubectl apply -f prometheus-service.yaml

批改 Grafana 数据源

Prometheus 拆分胜利之后，最初还要批改 Grafana 的数据源为 VictoriaMetrics 的地址，这样就能够在 Grafana 中查看全局视图，也能聚合查问。

关上 Grafana 的设置页面，将数据源批改为 http://victoriametrics.kube-system.svc.cluster.local:8428：

点击 Explore 菜单：

在查问框内输出 up，而后按下 Shift+Enter 键查问：

能够看到查问后果中蕴含了所有的 namespace。

如果你对我的 Grafana 主题配色很感兴趣，能够关注公众号『云原生实验室』，后盾回复 grafana 即可获取秘诀。

写这篇文章的起因是我的 k3s 集群每台节点的资源很缓和，而且监控的 target 很多，导致 Prometheus 间接把节点的内存资源耗费完了，不停地 OOM。为了充分利用我的云主机，不得不另谋他路，这才有了这篇文章。

Kubernetes 1.18.2 1.17.5 1.16.9 1.15.12离线安装包公布地址http://store.lameleg.com ，欢送体验。应用了最新的sealos v3.3.6版本。作了主机名解析配置优化，lvscare 挂载/lib/module解决开机启动ipvs加载问题，修复lvscare社区netlink与3.10内核不兼容问题,sealos生成百年证书等个性。更多个性 https://github.com/fanux/sealos 。欢送扫描下方的二维码退出钉钉群，钉钉群曾经集成sealos的机器人实时能够看到sealos的动静。