过来几年,Kubernetes曾经成为容器编排的规范,越来越多的公司开始在生产零碎应用Kubernetes。通常咱们应用Prometheus对K8S集群进行监控,但因为Prometheus本身单点的问题。不得不寻求一些联邦计划或者分布式高可用计划,社区热度比拟高的我的项目有Thanos,Cortex,VictoriaMetrics。本文就介绍应用VictoriaMetrics作为数据存储后端对K8S集群进行监控,k8s部署不再具体形容。

环境版本

试验应用单节点k8s 网络组件应用cilium VictoriaMetrics存储应用localpv

[root@cilium-1 victoria-metrics-cluster]# cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core)[root@cilium-bgp-1 victoria-metrics-cluster]# uname -r4.19.110-300.el7.x86_64[root@cilium-1 pvs]# kubectl get nodeNAME                     STATUS   ROLES    AGE   VERSIONcilium-1.novalocal   Ready    master   28m   v1.19.4

次要监控指标

  • master,node节点负载状态
  • k8s 组件状态
  • etcd状态
  • k8s集群资源状态 (deploy,sts,pod...)
  • 用户自定义组件(次要通过pod定义prometheus.io/scrape主动上报target)
  • ...

监控须要部署的组件

  • VictoriaMetrics(storage,insert,select,agent,vmalert)
  • promxy
  • kube-state-metrics
  • node-exporter
  • karma
  • alertmanager
  • grafana
  • ...

部署VictoriaMetrics

创立localpv为storage组件提供StorageClass 也能够应用其余网络存储

---apiVersion: storage.k8s.io/v1kind: StorageClassmetadata:  name: vm-disksprovisioner: kubernetes.io/no-provisionerreclaimPolicy: RetainvolumeBindingMode: WaitForFirstConsumer---apiVersion: v1kind: PersistentVolumemetadata:  name: vm-1spec:  capacity:    storage: 10Gi  accessModes:  - ReadWriteOnce  persistentVolumeReclaimPolicy: Delete  storageClassName: vm-disks  local:    path: /mnt/vmdata-1  nodeAffinity:    required:      nodeSelectorTerms:      - matchExpressions:        - key: kubernetes.io/hostname          operator: In          values:          - cilium-1.novalocal---...
[root@cilium-1 pvs]# kubectl get scNAME       PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGEvm-disks   kubernetes.io/no-provisioner   Retain          WaitForFirstConsumer   false                  6m5s[root@cilium-1 pvs]# kubectl get pvNAME   CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGEvm-1   10Gi       RWO            Delete           Available           vm-disks                92svm-2   10Gi       RWO            Delete           Available           vm-disks                92svm-3   10Gi       RWO            Delete           Available           vm-disks                92s

应用helm 进行装置

增加helm repo 拉取chart包并解压

helm repo add vm https://victoriametrics.github.io/helm-charts/helm repo updatehelm fetch vm/victoria-metrics-clustertar -xf victoria-metrics-cluster-0.8.25.tgz Chart.yaml  README.md  README.md.gotmpl  templates  values.yaml

依据本人的需要批改values.yaml

这里我次要批改vmstorage组件配置storageclass

# values.yaml# Default values for victoria-metrics.# This is a YAML-formatted file.# Declare variables to be passed into your templates.# -- k8s cluster domain suffix, uses for building stroage pods' FQDN. Ref: [https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/)clusterDomainSuffix: cluster.localprintNotes: truerbac:  create: true  pspEnabled: true  namespaced: false  extraLabels: {}  # annotations: {}serviceAccount:  create: true  # name:  extraLabels: {}  # annotations: {}  # mount API token to pod directly  automountToken: trueextraSecrets:  []  # - name: secret-remote-storage-keys  #   annotations: []  #   labels: []  #   data: |  #     credentials: b64_encoded_strvmselect:  # -- 为vmselect组件创立deployment. 如果有缓存数据的须要,也能够创立为  enabled: true  # -- Vmselect container name  name: vmselect  image:    # -- Image repository    repository: victoriametrics/vmselect    # -- Image tag    tag: v1.59.0-cluster    # -- Image pull policy    pullPolicy: IfNotPresent  # -- Name of Priority Class  priorityClassName: ""  # -- Overrides the full name of vmselect component  fullnameOverride: ""  # -- Suppress rendering `--storageNode` FQDNs based on `vmstorage.replicaCount` value. If true suppress rendering `--stroageNodes`, they can be re-defined in exrtaArgs  suppresStorageFQDNsRender: false  automountServiceAccountToken: true  # Extra command line arguments for vmselect component  extraArgs:    envflag.enable: "true"    envflag.prefix: VM_    loggerFormat: json  annotations: {}  extraLabels: {}  env: []  # Readiness & Liveness probes  probe:    readiness:      initialDelaySeconds: 5      periodSeconds: 15      timeoutSeconds: 5      failureThreshold: 3    liveness:      initialDelaySeconds: 5      periodSeconds: 15      timeoutSeconds: 5      failureThreshold: 3  # Additional hostPath mounts  extraHostPathMounts:   []   # - name: certs-dir   #   mountPath: /etc/kubernetes/certs   #   subPath: ""   #   hostPath: /etc/kubernetes/certs   #   readOnly: true  # Extra Volumes for the pod  extraVolumes:    []    # - name: example    #   configMap:    #     name: example  # Extra Volume Mounts for the container  extraVolumeMounts:    []    # - name: example    #   mountPath: /example  extraContainers:    []    # - name: config-reloader    #   image: reloader-image  initContainers:    []    # - name: example    #   image: example-image  podDisruptionBudget:    # -- See `kubectl explain poddisruptionbudget.spec` for more. Ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/    enabled: false    # minAvailable: 1    # maxUnavailable: 1    labels: {}  # -- Array of tolerations object. Ref: [https://kubernetes.io/docs/concepts/configuration/assign-pod-node/](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)  tolerations: []    # - key: "key"    #   operator: "Equal|Exists"    #   value: "value"    #   effect: "NoSchedule|PreferNoSchedule"  # -- Pod's node selector. Ref: [https://kubernetes.io/docs/user-guide/node-selection/](https://kubernetes.io/docs/user-guide/node-selection/)  nodeSelector: {}  # -- Pod affinity  affinity: {}  # -- Pod's annotations  podAnnotations: {}  # -- Count of vmselect pods  replicaCount: 2  # -- Resource object  resources: {}    # limits:    #   cpu: 50m    #   memory: 64Mi    # requests:    #   cpu: 50m    #   memory: 64Mi  # -- Pod's security context. Ref: [https://kubernetes.io/docs/tasks/configure-pod-container/security-context/](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/  securityContext: {}  podSecurityContext: {}  # -- Cache root folder  cacheMountPath: /cache  service:    # -- Service annotations    annotations: {}    # -- Service labels    labels: {}    # -- Service ClusterIP    clusterIP: ""    # -- Service External IPs. Ref: [https://kubernetes.io/docs/user-guide/services/#external-ips](https://kubernetes.io/docs/user-guide/services/#external-ips)    externalIPs: []    # -- Service load balacner IP    loadBalancerIP: ""    # -- Load balancer source range    loadBalancerSourceRanges: []    # -- Service port    servicePort: 8481    # -- Service type    type: ClusterIP  ingress:    # -- Enable deployment of ingress for vmselect component    enabled: false    # -- Ingress annotations    annotations: {}    #   kubernetes.io/ingress.class: nginx    #   kubernetes.io/tls-acme: 'true'    extraLabels: {}    # -- Array of host objects    hosts: []    #   - name: vmselect.local    #     path: /select    #    port: http    # -- Array of TLS objects    tls: []    #   - secretName: vmselect-ingress-tls    #     hosts:    #       - vmselect.local  statefulSet:    # -- Deploy StatefulSet instead of Deployment for vmselect. Useful if you want to keep cache data. Creates statefulset instead of deployment, useful when you want to keep the cache    enabled: false    # -- Deploy order policy for StatefulSet pods    podManagementPolicy: OrderedReady    ## Headless service for statefulset    service:      # -- Headless service annotations      annotations: {}      # -- Headless service labels      labels: {}      # -- Headless service port      servicePort: 8481  persistentVolume:    # -- Create/use Persistent Volume Claim for vmselect component. Empty dir if false. If true, vmselect will create/use a Persistent Volume Claim    enabled: false    # -- Array of access mode. Must match those of existing PV or dynamic provisioner. Ref: [http://kubernetes.io/docs/user-guide/persistent-volumes/](http://kubernetes.io/docs/user-guide/persistent-volumes/)    accessModes:      - ReadWriteOnce    # -- Persistent volume annotations    annotations: {}    # -- Existing Claim name. Requires vmselect.persistentVolume.enabled: true. If defined, PVC must be created manually before volume will be bound    existingClaim: ""    ## Vmselect data Persistent Volume mount root path    ##    # -- Size of the volume. Better to set the same as resource limit memory property    size: 2Gi    # -- Mount subpath    subPath: ""  serviceMonitor:    # -- Enable deployment of Service Monitor for vmselect component. This is Prometheus operator object    enabled: false    # -- Target namespace of ServiceMonitor manifest    namespace: ""    # -- Service Monitor labels    extraLabels: {}    # -- Service Monitor annotations    annotations: {}    # Commented. Prometheus scare interval for vmselect component#    interval: 15s    # Commented. Prometheus pre-scrape timeout for vmselect component#    scrapeTimeout: 5svminsert:  # -- Enable deployment of vminsert component. Deployment is used  enabled: true  # -- vminsert container name  name: vminsert  image:    # -- Image repository    repository: victoriametrics/vminsert    # -- Image tag    tag: v1.59.0-cluster    # -- Image pull policy    pullPolicy: IfNotPresent  # -- Name of Priority Class  priorityClassName: ""  # -- Overrides the full name of vminsert component  fullnameOverride: ""  # Extra command line arguments for vminsert component  extraArgs:    envflag.enable: "true"    envflag.prefix: VM_    loggerFormat: json  annotations: {}  extraLabels: {}  env: []  # -- Suppress rendering `--storageNode` FQDNs based on `vmstorage.replicaCount` value. If true suppress rendering `--stroageNodes`, they can be re-defined in exrtaArgs  suppresStorageFQDNsRender: false  automountServiceAccountToken: true  # Readiness & Liveness probes  probe:    readiness:      initialDelaySeconds: 5      periodSeconds: 15      timeoutSeconds: 5      failureThreshold: 3    liveness:      initialDelaySeconds: 5      periodSeconds: 15      timeoutSeconds: 5      failureThreshold: 3  initContainers:    []    # - name: example    #   image: example-image  podDisruptionBudget:    # -- See `kubectl explain poddisruptionbudget.spec` for more. Ref: [https://kubernetes.io/docs/tasks/run-application/configure-pdb/](https://kubernetes.io/docs/tasks/run-application/configure-pdb/)    enabled: false    # minAvailable: 1    # maxUnavailable: 1    labels: {}  # -- Array of tolerations object. Ref: [https://kubernetes.io/docs/concepts/configuration/assign-pod-node/](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)  tolerations: []    # - key: "key"    #   operator: "Equal|Exists"    #   value: "value"    #   effect: "NoSchedule|PreferNoSchedule"  # -- Pod's node selector. Ref: [https://kubernetes.io/docs/user-guide/node-selection/](https://kubernetes.io/docs/user-guide/node-selection/)  nodeSelector: {}  # -- Pod affinity  affinity: {}  # -- Pod's annotations  podAnnotations: {}  # -- Count of vminsert pods  replicaCount: 2  # -- Resource object  resources: {}    # limits:    #   cpu: 50m    #   memory: 64Mi    # requests:    #   cpu: 50m    #   memory: 64Mi  # -- Pod's security context. Ref: [https://kubernetes.io/docs/tasks/configure-pod-container/security-context/](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)  securityContext: {}  podSecurityContext: {}  service:    # -- Service annotations    annotations: {}    # -- Service labels    labels: {}    # -- Service ClusterIP    clusterIP: ""    # -- Service External IPs. Ref: [https://kubernetes.io/docs/user-guide/services/#external-ips]( https://kubernetes.io/docs/user-guide/services/#external-ips)    externalIPs: []    # -- Service load balancer IP    loadBalancerIP: ""    # -- Load balancer source range    loadBalancerSourceRanges: []    # -- Service port    servicePort: 8480    # -- Service type    type: ClusterIP  ingress:    # -- Enable deployment of ingress for vminsert component    enabled: false    # -- Ingress annotations    annotations: {}    #   kubernetes.io/ingress.class: nginx    #   kubernetes.io/tls-acme: 'true'    extraLabels: {}    # -- Array of host objects    hosts: []    # - name: vminsert.local    #   path: /insert    #   port: http    # -- Array of TLS objects    tls: []    #   - secretName: vminsert-ingress-tls    #     hosts:    #       - vminsert.local  serviceMonitor:    # -- Enable deployment of Service Monitor for vminsert component. This is Prometheus operator object    enabled: false    # -- Target namespace of ServiceMonitor manifest    namespace: ""    # -- Service Monitor labels    extraLabels: {}    # -- Service Monitor annotations    annotations: {}    # Commented. Prometheus scare interval for vminsert component#    interval: 15s    # Commented. Prometheus pre-scrape timeout for vminsert component#    scrapeTimeout: 5svmstorage:  # -- Enable deployment of vmstorage component. StatefulSet is used  enabled: true  # -- vmstorage container name  name: vmstorage  image:    # -- Image repository    repository: victoriametrics/vmstorage    # -- Image tag    tag: v1.59.0-cluster    # -- Image pull policy    pullPolicy: IfNotPresent  # -- Name of Priority Class  priorityClassName: ""  # -- Overrides the full name of vmstorage component  fullnameOverride:  automountServiceAccountToken: true  env: []  # -- Data retention period. Supported values 1w, 1d, number without measurement means month, e.g. 2 = 2month  retentionPeriod: 1  # Additional vmstorage container arguments. Extra command line arguments for vmstorage component  extraArgs:    envflag.enable: "true"    envflag.prefix: VM_    loggerFormat: json  # Additional hostPath mounts  extraHostPathMounts:    []    # - name: certs-dir    #   mountPath: /etc/kubernetes/certs    #   subPath: ""    #   hostPath: /etc/kubernetes/certs    #   readOnly: true  # Extra Volumes for the pod  extraVolumes:    []    # - name: example    #   configMap:    #     name: example  # Extra Volume Mounts for the container  extraVolumeMounts:    []    # - name: example    #   mountPath: /example  extraContainers:    []    # - name: config-reloader    #   image: reloader-image  initContainers:    []    # - name: vmrestore    #   image: victoriametrics/vmrestore:latest    #   volumeMounts:    #   - mountPath: /storage    #     name: vmstorage-volume    #   - mountPath: /etc/vm/creds    #     name: secret-remote-storage-keys    #     readOnly: true    #   args:    #   - -storageDataPath=/storage    #   - -src=s3://your_bucket/folder/latest    #   - -credsFilePath=/etc/vm/creds/credentials  # -- See `kubectl explain poddisruptionbudget.spec` for more. Ref: [https://kubernetes.io/docs/tasks/run-application/configure-pdb/](https://kubernetes.io/docs/tasks/run-application/configure-pdb/)  podDisruptionBudget:    enabled: false    # minAvailable: 1    # maxUnavailable: 1    labels: {}  # -- Array of tolerations object. Node tolerations for server scheduling to nodes with taints. Ref: [https://kubernetes.io/docs/concepts/configuration/assign-pod-node/](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)  ##  tolerations:    []    # - key: "key"    #   operator: "Equal|Exists"    #   value: "value"    #   effect: "NoSchedule|PreferNoSchedule"  # -- Pod's node selector. Ref: [https://kubernetes.io/docs/user-guide/node-selection/](https://kubernetes.io/docs/user-guide/node-selection/)  nodeSelector: {}  # -- Pod affinity  affinity: {}  ## Use an alternate scheduler, e.g. "stork".  ## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/  ##  # schedulerName:  persistentVolume:    # -- Create/use Persistent Volume Claim for vmstorage component. Empty dir if false. If true,  vmstorage will create/use a Persistent Volume Claim    enabled: true    # -- Array of access modes. Must match those of existing PV or dynamic provisioner. Ref: [http://kubernetes.io/docs/user-guide/persistent-volumes/](http://kubernetes.io/docs/user-guide/persistent-volumes/)    accessModes:      - ReadWriteOnce    # -- Persistent volume annotations    annotations: {}    # -- Storage class name. Will be empty if not setted    storageClass: "vm-disks" 为vm-storage指定storageclass    # --  Existing Claim name. Requires vmstorage.persistentVolume.enabled: true. If defined, PVC must be created manually before volume will be bound    existingClaim: ""    # -- Data root path. Vmstorage data Persistent Volume mount root path    mountPath: /storage    # -- Size of the volume. Better to set the same as resource limit memory property    size: 8Gi    # -- Mount subpath    subPath: ""  # -- Pod's annotations  podAnnotations: {}  annotations: {}  extraLabels: {}  # -- Count of vmstorage pods  replicaCount: 3  # -- Deploy order policy for StatefulSet pods  podManagementPolicy: OrderedReady  # -- Resource object. Ref: [http://kubernetes.io/docs/user-guide/compute-resources/](http://kubernetes.io/docs/user-guide/compute-resources/)  resources: {}    # limits:    #   cpu: 500m    #   memory: 512Mi    # requests:    #   cpu: 500m    #   memory: 512Mi  # -- Pod's security context. Ref: [https://kubernetes.io/docs/tasks/configure-pod-container/security-context/](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)  securityContext: {}  podSecurityContext: {}  service:    # -- Service annotations    annotations: {}    # -- Service labels    labels: {}    # -- Service port    servicePort: 8482    # -- Port for accepting connections from vminsert    vminsertPort: 8400    # -- Port for accepting connections from vmselect    vmselectPort: 8401  # -- Pod's termination grace period in seconds  terminationGracePeriodSeconds: 60  probe:    readiness:      initialDelaySeconds: 5      periodSeconds: 15      timeoutSeconds: 5      failureThreshold: 3    liveness:      initialDelaySeconds: 5      periodSeconds: 15      timeoutSeconds: 5      failureThreshold: 3  serviceMonitor:    # -- Enable deployment of Service Monitor for vmstorage component. This is Prometheus operator object    enabled: false    # -- Target namespace of ServiceMonitor manifest    namespace: ""    # -- Service Monitor labels    extraLabels: {}    # -- Service Monitor annotations    annotations: {}    # Commented. Prometheus scare interval for vmstorage component#    interval: 15s    # Commented. Prometheus pre-scrape timeout for vmstorage component#    scrapeTimeout: 5s

部署

kubectl create ns vmhelm install vm -n vm ./如果须要输入渲染好的yaml文件能够增加参数--debug --dry-run

查看创立资源

[root@cilium-1 ~]# kubectl get po -n vmNAME                                                    READY   STATUS    RESTARTS   AGEvm-victoria-metrics-cluster-vminsert-559db87988-cnb7g   1/1     Running   0          5mvm-victoria-metrics-cluster-vminsert-559db87988-jm4cj   1/1     Running   0          5mvm-victoria-metrics-cluster-vmselect-b77474bcf-6rrcz    1/1     Running   0          5mvm-victoria-metrics-cluster-vmselect-b77474bcf-dsl4j    1/1     Running   0          5mvm-victoria-metrics-cluster-vmstorage-0                 1/1     Running   0          5mvm-victoria-metrics-cluster-vmstorage-1                 1/1     Running   0          5mvm-victoria-metrics-cluster-vmstorage-2                 1/1     Running   0          5m

部署kube-state-metrics

kube-state-metrics是一个简略的服务,它监听Kubernetes API服务器并生成对于对象状态的指标。

依据集群的k8s版本抉择适合的kube-state-metrics版本

kube-state-metricsKubernetes 1.16Kubernetes 1.17Kubernetes 1.18Kubernetes 1.19Kubernetes 1.20
v1.8.0-----
v1.9.8----
v2.0.0--/✓-/✓
master--/✓-/✓
  • Fully supported version range.
  • - The Kubernetes cluster has features the client-go library can't use (additional API objects, deprecated APIs, etc).

本文选用v2.0.0

git clone https://github.com/kubernetes/kube-state-metrics.git -b release-2.0 cd kube-state-metrics/examples/autosharding# 次要文件kube-state-metrics/examples/autosharding[root@cilium-bgp-1 autosharding]# lscluster-role-binding.yaml  cluster-role.yaml  role-binding.yaml  role.yaml  service-account.yaml  service.yaml  statefulset.yamlkubectl apply -f ./

因为网络问题可能会遇到kube-state-metrics镜像无奈拉取的状况,能够批改statefulset.yaml 应用 bitnami/kube-state-metrics:2.0.0

部署node_exporter

node-exporter用于采集服务器层面的运行指标,包含机器的loadavg、filesystem、meminfo等根底监控,相似于传统主机监控维度的zabbix-agent。

采纳daemonset的形式部署在K8S集群,并通过scrape正文。让vmagent主动增加targets。

---apiVersion: apps/v1kind: DaemonSetmetadata:  name: node-exporter  namespace: node-exporter  labels:    k8s-app: node-exporterspec:  selector:    matchLabels:      k8s-app: node-exporter  template:    metadata:      labels:        k8s-app: node-exporter      annotations:        prometheus.io/scrape: "true"        prometheus.io/port: "9100"        prometheus.io/path: "/metrics"    spec:      containers:      - name: node-exporter        image: quay.io/prometheus/node-exporter:v1.1.2        ports:        - name: metrics          containerPort: 9100        args:        - "--path.procfs=/host/proc"        - "--path.sysfs=/host/sys"        - "--path.rootfs=/host"        volumeMounts:        - name: dev          mountPath: /host/dev        - name: proc          mountPath: /host/proc        - name: sys          mountPath: /host/sys        - name: rootfs          mountPath: /host      volumes:        - name: dev          hostPath:            path: /dev        - name: proc          hostPath:            path: /proc        - name: sys          hostPath:            path: /sys        - name: rootfs          hostPath:            path: /      hostPID: true      hostNetwork: true      tolerations:      - operator: "Exists"

部署vmagent

因为连贯etcd须要配置证书,因为咱们的集群是应用kubeadm部署的,所以先在主节点创立etcd secret。

kubectl -n vm create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt

如果独立部署的etcd 集群,同样将证书保留到集群中的一个 secret 对象中去即可。

若监控kube-controller-manager 和 kube-scheduler 须要批改/etc/kubernetes/manifests 上面kube-controller-manager.yaml kube-scheduler.yaml 将<font color=red>- --bind-address=127.0.0.1</font> 改为0.0.0.0

#vmagent.yaml#创立vmagent service 用于拜访vmagent接口能够查看监控/targets---apiVersion: v1kind: Servicemetadata:  labels:    app: vmagent-k8s  name: vmagent-k8s  namespace: vmspec:  ports:    - port: 8429      protocol: TCP      targetPort: http      name: http  selector:    app: vmagent-k8s  type: NodePort---apiVersion: apps/v1kind: StatefulSetmetadata:  name: vmagent-k8s  namespace: vm  labels:    app: vmagent-k8sspec:  serviceName: "vmagent-k8s"  replicas: 1  selector:    matchLabels:      app: vmagent-k8s  template:    metadata:      labels:        app: vmagent-k8s    spec:      serviceAccountName: vmagent-k8s      containers:      - name: vmagent-k8s        image: victoriametrics/vmagent:v1.59.0        env:        - name: POD_NAME          valueFrom:            fieldRef:              fieldPath: metadata.name        args:          - -promscrape.config=/etc/prometheus/prometheus.yaml          - -remoteWrite.tmpDataPath=/vmtmp          - -remoteWrite.url=http://vmetric-victoria-metrics-cluster-vminsert.vm.svc.cluster.local:8480/insert/0/prometheus/ #配置vmagent remotewrite接口        ports:          - name: http            containerPort: 8429        volumeMounts:          - name: time            mountPath: /etc/localtime            readOnly: true          - name: config            mountPath: /etc/prometheus/          - mountPath: "/etc/kubernetes/pki/etcd/" 挂载etcd secret 用于连贯etcd接口            name: etcd-certs          - mountPath: "/vmtmp"            name: tmp      volumes:        - name: "tmp"          emptyDir: {}        - name: time          hostPath:            path: /etc/localtime        - name: config          configMap:            name: vmagent-k8s        - name: etcd-certs          secret:            secretName: etcd-certs     updateStrategy:    type: RollingUpdate---apiVersion: v1kind: ServiceAccountmetadata:  name: vmagent-k8s  namespace: vm---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:  name: vmagent-k8srules:- apiGroups: [""]  resources:  - nodes  - nodes/proxy  - services  - endpoints  - pods  verbs: ["get", "list", "watch"]- apiGroups: [""]  resources:  - configmaps  verbs: ["get"]- nonResourceURLs: ["/metrics"]  verbs: ["get"]---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:  name: vmagent-k8sroleRef:  apiGroup: rbac.authorization.k8s.io  kind: ClusterRole  name: vmagent-k8ssubjects:- kind: ServiceAccount  name: vmagent-k8s  namespace: vm---#应用configmap为vmagent挂载配置文件apiVersion: v1kind: ConfigMapmetadata:  name: vmagent-k8s  namespace: vmdata:  prometheus.yaml: |-    global:      scrape_interval: 60s      scrape_timeout: 60s      external_labels:        cluster: test  # 依据需要增加自定义标签        datacenter: test    scrape_configs:    - job_name: etcd      scheme: https      tls_config:        insecure_skip_verify: true        ca_file: /etc/kubernetes/pki/etcd/ca.crt        cert_file: /etc/kubernetes/pki/etcd/healthcheck-client.crt        key_file: /etc/kubernetes/pki/etcd/healthcheck-client.key      static_configs:      - targets:        - 192.168.0.1:2379 #依据集群理论状况增加etcd endpoints    - job_name: kube-scheduler      scheme: http      static_configs:      - targets:        - 192.168.0.1:10259 #依据集群理论状况增加scheduler endpoints      tls_config:        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt        insecure_skip_verify: true      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token    - job_name: kube-controller-manager      scheme: http      static_configs:      - targets:        - 192.168.0.1:10257 #依据集群理论状况增加controller-manager endpoints      tls_config:        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt        insecure_skip_verify: true      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token    - job_name: kube-apiserver #配置apiserver连贯      kubernetes_sd_configs:      - role: endpoints      relabel_configs:      - action: keep        regex: default;kubernetes;https        source_labels:        - __meta_kubernetes_namespace        - __meta_kubernetes_service_name        - __meta_kubernetes_endpoint_port_name      scheme: https      tls_config:        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt        insecure_skip_verify: true      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token    - job_name: kubernetes-nodes # 配置kubelet代理      kubernetes_sd_configs:      - role: node      relabel_configs:      - action: labelmap        regex: __meta_kubernetes_node_label_(.+)      - target_label: __address__        replacement: kubernetes.default.svc:443      - source_labels: [__meta_kubernetes_node_name]        regex: (.+)        target_label: __metrics_path__        replacement: /api/v1/nodes/${1}/proxy/metrics      - action: labelmap        regex: __meta_kubernetes_pod_label_(.+)      scheme: https      tls_config:        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt        insecure_skip_verify: true      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token    - job_name: kubernetes-cadvisor #配置cadvisor代理 获取容器负载数据      scrape_interval: 15s      scrape_timeout: 15s      kubernetes_sd_configs:      - role: node      relabel_configs:      - target_label: __address__        replacement: kubernetes.default.svc:443      - source_labels: [__meta_kubernetes_node_name]        regex: (.+)        target_label: __metrics_path__        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor      - action: labelmap        regex: __meta_kubernetes_pod_label_(.+)      scheme: https      tls_config:        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt        insecure_skip_verify: true      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token    - job_name: kubernetes-service-endpoints # 配置获取服务后端endpoints 须要采集的targets      kubernetes_sd_configs:      - role: endpoints      relabel_configs:      - action: keep        regex: true        source_labels:        - __meta_kubernetes_service_annotation_prometheus_io_scrape      - action: replace        regex: (https?)        source_labels:        - __meta_kubernetes_service_annotation_prometheus_io_scheme        target_label: __scheme__      - action: replace        regex: (.+)        source_labels:        - __meta_kubernetes_service_annotation_prometheus_io_path        target_label: __metrics_path__      - action: replace        regex: ([^:]+)(?::\d+)?;(\d+)        replacement: $1:$2        source_labels:        - __address__        - __meta_kubernetes_service_annotation_prometheus_io_port        target_label: __address__      - action: labelmap        regex: __meta_kubernetes_service_label_(.+)      - action: replace        source_labels:        - __meta_kubernetes_namespace        target_label: kubernetes_namespace      - action: replace        source_labels:        - __meta_kubernetes_service_name        target_label: kubernetes_name            - job_name: kubernetes-pods # 配置获取容器内服务自定义指标targets      kubernetes_sd_configs:      - role: pod      relabel_configs:      - action: keep        regex: true        source_labels:        - __meta_kubernetes_pod_annotation_prometheus_io_scrape      - action: replace        regex: (.+)        source_labels:        - __meta_kubernetes_pod_annotation_prometheus_io_path        target_label: __metrics_path__      - action: replace        regex: ([^:]+)(?::\d+)?;(\d+)        replacement: $1:$2        source_labels:        - __address__        - __meta_kubernetes_pod_annotation_prometheus_io_port        target_label: __address__      - action: labelmap        regex: __meta_kubernetes_pod_label_(.+)      - action: replace        source_labels:        - __meta_kubernetes_namespace        target_label: kubernetes_namespace      - action: replace        source_labels:        - __meta_kubernetes_pod_name        target_label: kubernetes_pod_name
kubectl apply -f vmagent.yaml

能够通过浏览器拜访agent nodeport /targets 接口查看监控对象

部署promxy

因为VictoriaMetrics没有查问UI,并且不提供Remote_read的性能。所以能够借助第三方工具Promxy实现相似于Prometheus界面填充promql查问数据的性能。

promxy 是一个聚合proxy 能够用来实现prometheus 的ha 具体的相干介绍能够参考github 文档,是一个值得尝试的工具,同时VictoriaMetrics对于本人的一些短板
也举荐了此工具。

[root@cilium-1 promxy]# cat promxy.yaml apiVersion: v1data:  config.yaml: |    ### Promxy configuration 仅须要配置victoriametrics select组件地址及接口     promxy:      server_groups:        - static_configs:            - targets:              - vm-victoria-metrics-cluster-vmselect.vm.svc.cluster.local:8481          path_prefix: /select/0/prometheuskind: ConfigMapmetadata:  name: promxy-config  namespace: vm---apiVersion: v1kind: Servicemetadata:  labels:    app: promxy  name: promxy  namespace: vmspec:  ports:  - name: promxy    port: 8082    protocol: TCP    targetPort: 8082  type: NodePort  selector:    app: promxy---apiVersion: apps/v1kind: Deploymentmetadata:  labels:    app: promxy  name: promxy  namespace: vmspec:  replicas: 1  selector:    matchLabels:      app: promxy  template:    metadata:      labels:        app: promxy    spec:      containers:      - args:        - "--config=/etc/promxy/config.yaml"        - "--web.enable-lifecycle"        command:        - "/bin/promxy"        image: quay.io/jacksontj/promxy:latest        imagePullPolicy: Always        livenessProbe:          httpGet:            path: "/-/healthy"            port: 8082          initialDelaySeconds: 3        name: promxy        ports:        - containerPort: 8082        readinessProbe:          httpGet:            path: "/-/ready"            port: 8082          initialDelaySeconds: 3        volumeMounts:        - mountPath: "/etc/promxy/"          name: promxy-config          readOnly: true      # container to reload configs on configmap change      - args:        - "--volume-dir=/etc/promxy"        - "--webhook-url=http://localhost:8082/-/reload"        image: jimmidyson/configmap-reload:v0.1        name: promxy-server-configmap-reload        volumeMounts:        - mountPath: "/etc/promxy/"          name: promxy-config          readOnly: true      volumes:      - configMap:          name: promxy-config        name: promxy-config

部署PrometheusAlert

Prometheus Alert是开源的运维告警核心音讯转发零碎,反对支流的监控零碎Prometheus,Zabbix,日志零碎Graylog和数据可视化零碎Grafana收回的预警音讯,反对钉钉,微信,华为云短信,腾讯云短信,腾讯云电话,阿里云短信,阿里云电话等。

本次试验应用PrometheusAlert作为Webhook接管Alertmanager转发的告警信息。

配置文件有缩减 仅留下钉钉配置样例 残缺配置请查阅官网文档

apiVersion: v1data:  app.conf: |    #---------------------↓全局配置-----------------------    appname = PrometheusAlert    #登录用户名    login_user=prometheusalert    #登录明码    login_password=prometheusalert    #监听地址    httpaddr = "0.0.0.0"    #监听端口    httpport = 8080    runmode = dev    #设置代理    #proxy =    #开启JSON申请    copyrequestbody = true    #告警音讯题目    title=PrometheusAlert    #链接到告警平台地址    #GraylogAlerturl=http://graylog.org    #钉钉告警 告警logo图标地址    logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png    #钉钉告警 复原logo图标地址    rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png    #故障复原是否启用电话告诉0为敞开,1为开启    phonecallresolved=0    #主动告警克制(主动告警克制是默认同一个告警源的告警信息只发送告警级别最高的第一条告警信息,其余音讯默认屏蔽,这么做的目标是为了缩小雷同告警起源的音讯数量,避免告警炸弹,0为敞开,1为开启)    silent=0    #是否前台输入file or console    logtype=file    #日志文件门路    logpath=logs/prometheusalertcenter.log    #转换Prometheus,graylog告警音讯的时区为CST时区(如默认曾经是CST时区,请勿开启)    prometheus_cst_time=0    #数据库驱动,反对sqlite3,mysql,postgres如应用mysql或postgres,请开启db_host,db_port,db_user,db_password,db_name的正文    db_driver=sqlite3    #---------------------↓webhook-----------------------    #是否开启钉钉告警通道,可同时开始多个通道0为敞开,1为开启    open-dingding=1    #默认钉钉机器人地址    ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx    #是否开启 @所有人(0为敞开,1为开启)    dd_isatall=1kind: ConfigMapmetadata:  name: prometheus-alert-center-conf  namespace: vm---apiVersion: apps/v1kind: Deploymentmetadata:  labels:    app: prometheus-alert-center    alertname: prometheus-alert-center  name: prometheus-alert-center  namespace: vm  spec:  replicas: 1  selector:    matchLabels:      app: prometheus-alert-center      alertname: prometheus-alert-center  template:    metadata:      labels:        app: prometheus-alert-center        alertname: prometheus-alert-center    spec:      containers:      - image: feiyu563/prometheus-alert        name: prometheus-alert-center        env:        - name: TZ          value: "Asia/Shanghai"        ports:        - containerPort: 8080          name: http        resources:          limits:            cpu: 200m            memory: 200Mi          requests:            cpu: 100m            memory: 100Mi        volumeMounts:        - name: prometheus-alert-center-conf-map          mountPath: /app/conf/app.conf          subPath: app.conf        - name: prometheus-alert-center-conf-map          mountPath: /app/user.csv          subPath: user.csv      volumes:      - name: prometheus-alert-center-conf-map        configMap:          name: prometheus-alert-center-conf          items:          - key: app.conf            path: app.conf          - key: user.csv            path: user.csv---apiVersion: v1kind: Servicemetadata:  labels:    alertname: prometheus-alert-center  name: prometheus-alert-center  namespace: vm    annotations:    prometheus.io/scrape: 'true'    prometheus.io/port: '8080'  spec:  ports:  - name: http    port: 8080    targetPort: http  type: NodePort  selector:    app: prometheus-alert-center

部署Alertmanager 和 Karma

警报始终是整个监控零碎中的重要组成部分,Prometheus监控零碎中,采集与警报是拆散的。警报规定在 Prometheus 定义,警报规定触发当前,才会将信息转发到给独立的组件Alertmanager ,通过 Alertmanager r对警报的信息处理后,最终通过接收器发送给指定用户。

Alertmanager反对配置以创立高可用性集群。 能够应用--cluster- *标记进行配置。 重要的是不要在Prometheus及其Alertmanagers之间均衡流量,而是将Prometheus指向所有Alertmanagers的列表。能够查看官网文档具体阐明。

Alertmanager引入了Gossip机制。Gossip机制为多个Alertmanager之间提供了信息传递的机制。确保及时在多个Alertmanager别离接管到雷同告警信息的状况下,也只有一个告警告诉被发送给Receiver。

为了可能让Alertmanager节点之间进行通信,须要在Alertmanager启动时设置相应的参数。其中次要的参数包含:

--cluster.listen-address string: 以后实例集群服务监听地址

--cluster.peer value: 初始化时关联的其它实例的集群服务地址

本次试验应用K8S statefulset资源部署Alertmanager 通过headless service 为正本之间提供域名寻址性能,配置PrometheusAlert作为报警转发后端。

Karma是一个比拟炫丽的告警展现工具,还提供多种筛选性能。查看告警更加直观。

k8s部署Alertmanager

测试环境没有配置长久化存储,理论部署尽量应用长久化存储

# config-map.yamlapiVersion: v1kind: ConfigMapmetadata:  name: alertmanager-config  namespace: vmdata:  alertmanager.yaml: |-    global:      resolve_timeout: 5m    route:      group_by: ['alertname']      group_wait: 5m      group_interval: 10s      repeat_interval: 10m      receiver: 'web.hook.prometheusalert'          receivers:    - name: web.hook.prometheusalert      webhook_configs:      - url: http://prometheus-alert-center:8080/prometheus/alert
# alertmanager-statefulset.yamlapiVersion: apps/v1kind: StatefulSetmetadata:  name: alertmanager  namespace: vmspec:  serviceName: alertmanager  podManagementPolicy: Parallel  replicas: 3  selector:    matchLabels:      app: alertmanager  template:    metadata:      labels:        app: alertmanager    spec:      securityContext:        fsGroup: 2000        runAsNonRoot: true        runAsUser: 1000      containers:      - name: alertmanager        image: quay.io/prometheus/alertmanager:v0.21.0        args:        - --config.file=/etc/alertmanager/config/alertmanager.yaml        - --cluster.listen-address=[$(POD_IP)]:9094        - --storage.path=/alertmanager        - --data.retention=120h        - --web.listen-address=:9093        - --web.route-prefix=/        - --cluster.peer=alertmanager-0.alertmanager.$(POD_NAMESPACE).svc:9094        - --cluster.peer=alertmanager-1.alertmanager.$(POD_NAMESPACE).svc:9094        - --cluster.peer=alertmanager-2.alertmanager.$(POD_NAMESPACE).svc:9094        env:        - name: POD_IP          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: status.podIP        - name: POD_NAMESPACE          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: metadata.namespace        ports:        - containerPort: 9093          name: web          protocol: TCP        - containerPort: 9094          name: mesh-tcp          protocol: TCP        - containerPort: 9094          name: mesh-udp          protocol: UDP        livenessProbe:          failureThreshold: 10          httpGet:            path: /-/healthy            port: web            scheme: HTTP          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 3        readinessProbe:          failureThreshold: 10          httpGet:            path: /-/ready            port: web            scheme: HTTP          periodSeconds: 5          successThreshold: 1          timeoutSeconds: 3        volumeMounts:        - mountPath: etc/alertmanager/config          name: alertmanager-config        - mountPath: alertmanager          name: alertmanager-storage      volumes:      - name: alertmanager-config        configMap:          defaultMode: 420          name: alertmanager-config      - name: alertmanager-storage        emptyDir: {}      
# alertmanager-svc.yamlapiVersion: v1kind: Servicemetadata:  name: alertmanager  namespace: vm  labels:    app: alertmanagerspec:  type: NodePort  selector:    app: alertmanager  ports:  - name: web    protocol: TCP    port: 9093    targetPort: web

k8s部署Karma

apiVersion: apps/v1kind: Deploymentmetadata:  labels:    app: karma  name: karma  namespace: vmspec:  replicas: 1  selector:    matchLabels:      app: karma  template:    metadata:      labels:        app: karma    spec:      containers:        - image: ghcr.io/prymitive/karma:v0.85          name: karma          ports:            - containerPort: 8080              name: http          resources:            limits:              cpu: 400m              memory: 400Mi            requests:              cpu: 200m              memory: 200Mi          env:          - name: ALERTMANAGER_URI            value: "http://alertmanager.vm.svc.cluster.local:9093"---apiVersion: v1kind: Servicemetadata:  labels:    app: karma  name: karma  namespace: vmspec:  ports:    - name: http      port: 8080      targetPort: http  selector:    app: karma  type: NodePort

部署grafana

测试环境没有配置长久化存储,理论部署尽量应用长久化存储。否则容器重启会导致配置的模板失落。

apiVersion: apps/v1kind: StatefulSetmetadata:  name: grafana  namespace: vm  labels:    app: grafanaspec:  replicas: 1  selector:    matchLabels:      app: grafana  serviceName: grafana  template:    metadata:      annotations:        prometheus.io/scrape: 'true'        prometheus.io/port: "3000"      labels:        app: grafana    spec:      containers:      - name: grafana        image: grafana:7.5.4        ports:        - name: http          containerPort: 3000        resources:          limits:            cpu: 1000m            memory: 1000Mi          requests:            cpu: 600m            memory: 600Mi        volumeMounts:          - name: grafana-storage            mountPath: /var/lib/grafana          - mountPath: /etc/grafana/provisioning/datasources            readOnly: false            name: grafana-datasources        env:        - name: GF_AUTH_ANONYMOUS_ENABLED          value: "true"        - name: GF_AUTH_ANONYMOUS_ORG_ROLE          value: "Admin"      securityContext:        runAsNonRoot: true        runAsUser: 65534      serviceAccountName: grafana      volumes:        - name: grafana-datasources          configMap:            name: grafana-datasources  updateStrategy:    type: RollingUpdate---apiVersion: v1kind: ServiceAccountmetadata:  name: grafana  namespace: vm---apiVersion: v1kind: Servicemetadata:  labels:    app: grafana  name: grafana  namespace: monitorspec:  ports:    - name: http      port: 3000      targetPort: http  selector:    statefulset.kubernetes.io/pod-name: grafana-0  type: NodePort---apiVersion: v1data:  prometheus.yaml: |-    {        "apiVersion": 1,        "datasources": [            {                 "access": "proxy",                 "editable": false,                 "name": "Prometheus",                 "orgId": 1,                 "type": "prometheus",                 "url": "http://vm-victoria-metrics-cluster-vmselect.vm.svc.cluster.local:8481:/select/0/prometheus",                 "version": 1            },        ]    }kind: ConfigMapmetadata:  name: grafana-datasources  namespace: vm

举荐罕用的模板

可自行导入,并依据理论需要批改。

  • Kubernetes for Prometheus Dashboard

  • Node Exporter for Prometheus Dashboard

部署vmalert

主要参数

  • -datasource.url 配置数据查问地址
  • -notifier.url 配置alertmanager地址
# rule.yaml 示例rule规定配置文件apiVersion: v1data:  k8s.yaml: |-    groups:    - name: k8s      rules:      - alert: KubernetesNodeReady        expr: kube_node_status_condition{condition="Ready",status="true"} == 0        for: 5m        labels:          alert_level: high          alert_type: state          alert_source_type: k8s        annotations:          summary: "Kubernetes Node ready (instance {{ $labels.instance }})"          description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: KubernetesMemoryPressure        expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1        for: 5m        labels:          alert_level: middle          alert_type: mem          alert_source_type: k8s        annotations:          summary: "Kubernetes memory pressure (instance {{ $labels.instance }})"          description: "{{ $labels.node }} has MemoryPressure condition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: KubernetesPodCrashLooping        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5        for: 5m        labels:          alert_level: middle          alert_type: state          alert_source_type: k8s        annotations:          summary: "Kubernetes pod crash looping (instance {{ $labels.instance }})"          description: "Pod {{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"  pod.yaml: |-    groups:    - name: pod      rules:      - alert: ContainerMemoryUsage        expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80        for: 5m        labels:          alert_level: middle          alert_type: mem          alert_source_type: pod        annotations:          summary: "Container Memory usage (instance {{ $labels.instance }})"          description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"  kvm_node_exporter.yaml: |-    groups:    - name: kvm      rules:      - alert: VirtualMachineDown        expr: up{machinetype="virtualmachine"} == 0        for: 2m        labels:          alert_level: high          alert_type: state          alert_source_type: kvm        annotations:          summary: "Prometheus VirtualmachineMachine target missing (instance {{ $labels.instance }})"          description: "A Prometheus VirtualMahine target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostUnusualDiskWriteLatency        expr: rate(node_disk_write_time_seconds_total{machinetype="virtualmachine"}[1m]) / rate(node_disk_writes_completed_total{machinetype="virtualmachine"}[1m]) > 100        for: 5m        labels:          alert_level: middle          alert_type: disk          alert_source_type: kvm        annotations:          summary: "Host unusual disk write latency (instance {{ $labels.instance }})"          description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostHighCpuLoad        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",machinetype="virtualmachine"}[5m])) * 100) > 80        for: 5m        labels:          alert_level: middle          alert_type: cpu          alert_source_type: kvm        annotations:          summary: "Host high CPU load (instance {{ $labels.instance }})"          description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostSwapIsFillingUp        expr: (1 - (node_memory_SwapFree_bytes{machinetype="virtualmachine"} / node_memory_SwapTotal_bytes{machinetype="virtualmachine"})) * 100 > 80        for: 5m        labels:          alert_level: middle          alert_type: mem          alert_source_type: kvm        annotations:          summary: "Host swap is filling up (instance {{ $labels.instance }})"          description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostUnusualNetworkThroughputIn        expr: sum by (instance) (irate(node_network_receive_bytes_total{machinetype="virtualmachine"}[2m])) / 1024 / 1024 > 100        for: 5m        labels:          alert_level: middle          alert_type: network          alert_source_type: kvm        annotations:          summary: "Host unusual network throughput in (instance {{ $labels.instance }})"          description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostOutOfMemory        expr: node_memory_MemAvailable_bytes{machinetype="virtualmachine"} / node_memory_MemTotal_bytes{machinetype="virtualmachine"} * 100 < 10        for: 5m        labels:          alert_level: middle          alert_type: mem          alert_source_type: kvm        annotations:          summary: "Host out of memory (instance {{ $labels.instance }})"          description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"          description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"  node_exporter.yaml: |-    groups:    - name: machine      rules:      - alert: MachineDown        expr: up{machinetype="physicalmachine"} == 0        for: 2m        labels:          alert_level: high          alert_type: state          alert_source_type: machine        annotations:          summary: "Prometheus Machine target missing (instance {{ $labels.instance }})"          description: "A Prometheus Mahine target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostUnusualDiskWriteLatency        expr: rate(node_disk_write_time_seconds_total{machinetype="physicalmachine"}[1m]) / rate(node_disk_writes_completed_total{machinetype="physicalmachine"}[1m]) > 100        for: 5m        labels:          alert_level: middle          alert_type: disk          alert_source_type: machine        annotations:          summary: "Host unusual disk write latency (instance {{ $labels.instance }})"          description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostHighCpuLoad        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",machinetype="physicalmachine"}[5m])) * 100) > 80        for: 5m        labels:          alert_level: middle          alert_type: cpu          alert_source_type: machine        annotations:          summary: "Host high CPU load (instance {{ $labels.instance }})"          description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostSwapIsFillingUp        expr: (1 - (node_memory_SwapFree_bytes{machinetype="physicalmachine"} / node_memory_SwapTotal_bytes{machinetype="physicalmachine"})) * 100 > 80        for: 5m        labels:          alert_level: middle          alert_type: state          alert_source_type: machine        annotations:          summary: "Host swap is filling up (instance {{ $labels.instance }})"          description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostUnusualNetworkThroughputIn        expr: sum by (instance) (irate(node_network_receive_bytes_total{machinetype="physicalmachine"}[2m])) / 1024 / 1024 > 100        for: 5m        labels:          alert_level: middle          alert_type: network          alert_source_type: machine        annotations:          summary: "Host unusual network throughput in (instance {{ $labels.instance }})"          description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostOutOfMemory        expr: node_memory_MemAvailable_bytes{machinetype="physicalmachine"} / node_memory_MemTotal_bytes{machinetype="physicalmachine"} * 100 < 10        for: 5m        labels:          alert_level: middle          alert_type: mem          alert_source_type: machine        annotations:          summary: "Host out of memory (instance {{ $labels.instance }})"          description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"          description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostOutOfDiskSpace        expr: (node_filesystem_avail_bytes{machinetype="physicalmachine"} * 100) / node_filesystem_size_bytesi{machinetype="physicalmachine"} < 10        for: 5m        labels:          alert_level: middle          alert_type: disk          alert_source_type: machine        annotations:          summary: "Host out of disk space (instance {{ $labels.instance }})"          description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostDiskWillFillIn4Hours        expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs",machinetype="physicalmachine"}[1h], 4 * 3600) < 0        for: 5m        labels:          alert_level: middle          alert_type: disk          alert_source_type: machine        annotations:          summary: "Host disk will fill in 4 hours (instance {{ $labels.instance }})"          description: "Disk will fill in 4 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostOutOfInodes        expr: node_filesystem_files_free{mountpoint ="/rootfs",machinetype="physicalmachine"} / node_filesystem_files{mountpoint ="/rootfs",machinetype="physicalmachine"} * 100 < 10        for: 5m        labels:          alert_level: middle          alert_type: disk          alert_source_type: machine        annotations:          summary: "Host out of inodes (instance {{ $labels.instance }})"          description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostOomKillDetected        expr: increase(node_vmstat_oom_kill{machinetype="physicalmachine"}[5m]) > 0        for: 5m        labels:          alert_level: middle          alert_type: state          alert_source_type: machine        annotations:          summary: "Host OOM kill detected (instance {{ $labels.instance }})"          description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      - alert: HostNetworkTransmitErrors        expr: increase(node_network_transmit_errs_total{machinetype="physicalmachine"}[5m]) > 0        for: 5m        labels:          alert_level: middle          alert_type: network          alert_source_type: machine        annotations:          summary: "Host Network Transmit Errors (instance {{ $labels.instance }})"          description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'kind: ConfigMapmetadata:  name: vmalert-ruler  namespace: vm

罕用ruler配置请点击awesome-prometheus-alerts

apiVersion: v1kind: Servicemetadata:  labels:    app: vmalert  name: vmalert  namespace: vmspec:  ports:  - name: vmalert    port: 8080    protocol: TCP    targetPort: 8080  type: NodePort  selector:    app: vmalert---apiVersion: apps/v1kind: Deploymentmetadata:  labels:    app: vmalert  name: vmalert  namespace: vmspec:  replicas: 1  selector:    matchLabels:      app: vmalert  template:    metadata:      labels:        app: vmalert    spec:      containers:      - args:        - "-rule=/etc/ruler/*.yaml"        - "-datasource.url=http://vm-victoria-metrics-cluster-vmselect.vm.svc.cluster.local:8481/select/0/prometheus"         - "-notifier.url=http://alertmanager.monitor.svc.cluster.local"        - "-evaluationInterval=60s"        - "-httpListenAddr=0.0.0.0:8080"        command:        - "/vmalert-prod"        image:  victoriametrics/vmalert:v1.59.0        name: vmalert        imagePullPolicy: Always        volumeMounts:        - mountPath: "/etc/ruler/"          name: ruler          readOnly: true      volumes:      - configMap:          name: vmalert-ruler        name: ruler