<article class=“article fmt article-content”><p>promethus</p><p>基于k8s</p><p>收集数据</p><p>node-exporter</p><p>vi node-exporter-ds.yml</p><p><code></code></p><p>apiVersion: extensions/v1beta1</p><p>kind: DaemonSet</p><p>metadata:</p><p> name: node-exporter</p><p> labels:</p><p> app: node-exporter</p><p>spec:</p><p> template:</p><p> metadata:</p><p> labels:</p><p> app: node-exporter</p><p> spec:</p><p> hostNetwork: true</p><p> containers:</p><p> - image: prom/node-exporter</p><p> name: node-exporter</p><p> ports:</p><p> - containerPort: 9100</p><p> volumeMounts:</p><p> - mountPath: “/etc/localtime”</p><p> name: timezone</p><p> volumes:</p><p> - name: timezone</p><p> hostPath:</p><p> path: /etc/localtime</p><p></p><p>存储,长久卷,创立一个10G的pv,基于nfs</p><p>vi prometheus-pv.yaml</p><p><code></code></p><p>apiVersion: v1</p><p>kind: PersistentVolume</p><p>metadata:</p><p> name: gwj-pv-prometheus</p><p> labels:</p><p> app: gwj-pv</p><p>spec:</p><p> capacity:</p><p> storage: 10Gi</p><p> volumeMode: Filesystem</p><p> accessModes:</p><p> - ReadWriteMany</p><p> persistentVolumeReclaimPolicy: Recycle</p><p> storageClassName: slow</p><p> mountOptions:</p><p> - hard</p><p> - nfsvers=4.1</p><p> nfs:</p><p> path: /storage/gwj-prometheus</p><p> server: 10.1.99.1</p><p></p><p>长久卷申领,基于刚刚创立的pv,申领一个5G的pvc</p><p>vi prometheus-pvc.yaml</p><p><code></code></p><p>apiVersion: v1</p><p>kind: PersistentVolumeClaim</p><p>metadata:</p><p> name: gwj-prometheus-pvc</p><p> namespace: gwj</p><p>spec:</p><p> accessModes:</p><p> - ReadWriteMany</p><p> volumeMode: Filesystem</p><p> resources:</p><p> requests:</p><p> storage: 5Gi</p><p> selector:</p><p> matchLabels:</p><p> app: gwj-pv</p><p> storageClassName: slow</p><p></p><p>设置prometheus rbac权限</p><p>clusterrole.rbac.authorization.k8s.io/gwj-prometheus-clusterrole created</p><p>serviceaccount/gwj-prometheus created</p><p>clusterrolebinding.rbac.authorization.k8s.io/gwj-prometheus-rolebinding created</p><p>vi prometheus-rbac.yml</p><p><code></code></p><p>apiVersion: rbac.authorization.k8s.io/v1beta1</p><p>kind: ClusterRole</p><p>metadata:</p><p> name: gwj-prometheus-clusterrole</p><p>rules:</p><ul><li>apiGroups: [""]</li></ul><p> resources:</p><p> - nodes</p><p> - nodes/proxy</p><p> - services</p><p> - endpoints</p><p> - pods</p><p> verbs: [“get”, “list”, “watch”]</p><ul><li>apiGroups:</li></ul><p> - extensions</p><p> resources:</p><p> - ingresses</p><p> verbs: [“get”, “list”, “watch”]</p><ul><li>nonResourceURLs: ["/metrics"]</li></ul><p> verbs: [“get”]</p><hr/><p>apiVersion: v1</p><p>kind: ServiceAccount</p><p>metadata:</p><p> namespace: gwj</p><p> name: gwj-prometheus</p><hr/><p>apiVersion: rbac.authorization.k8s.io/v1beta1</p><p>kind: ClusterRoleBinding</p><p>metadata:</p><p> name: gwj-prometheus-rolebinding</p><p>roleRef:</p><p> apiGroup: rbac.authorization.k8s.io</p><p> kind: ClusterRole</p><p> name: gwj-prometheus-clusterrole</p><p>subjects:</p><ul><li>kind: ServiceAccount</li></ul><p> name: gwj-prometheus</p><p> namespace: gwj</p><p></p><p>创立prometheus 配置文件,应用configmap</p><p>vi prometheus-cm.yml</p><p><code></code></p><p>apiVersion: v1</p><p>kind: ConfigMap</p><p>metadata:</p><p> name: gwj-prometheus-cm</p><p> namespace: gwj</p><p>data:</p><p> prometheus.yml: |</p><p> rule_files:</p><p> - /etc/prometheus/rules.yml</p><p> alerting:</p><p> alertmanagers:</p><p> - static_configs:</p><p> - targets: [“gwj-alertmanger-svc:80”]</p><p> global:</p><p> scrape_interval: 10s</p><p> scrape_timeout: 10s</p><p> evaluation_interval: 10s</p><p> scrape_configs:</p><p> - job_name: ‘kubernetes-nodes’</p><p> scheme: https</p><p> tls_config:</p><p> ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt</p><p> bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token</p><p> kubernetes_sd_configs:</p><p> - role: node</p><p> relabel_configs:</p><p> - action: labelmap</p><p> regex: _meta_kubernetes_node_label(.+)</p><p> - source_labels: [__meta_kubernetes_node_name]</p><p> regex: (.+)</p><p> target_label: <strong>metrics_path</strong></p><p> replacement: /api/v1/nodes/${1}/proxy/metrics</p><p> - target_label: <strong>address</strong></p><p> replacement: kubernetes.default.svc:443</p><p> - job_name: ‘kubernetes-node-exporter’</p><p> tls_config:</p><p> ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt</p><p> bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token</p><p> kubernetes_sd_configs:</p><p> - role: node</p><p> relabel_configs:</p><p> - action: labelmap</p><p> regex: _meta_kubernetes_node_label(.+)</p><p> - source_labels: [__meta_kubernetes_role]</p><p> action: replace</p><p> target_label: kubernetes_role</p><p> - source_labels: [address]</p><p> regex: ‘(.*):10250’</p><p> replacement: ‘${1}:9100’</p><p> target_label: <strong>address</strong></p><p> - job_name: ‘kubernetes-pods’</p><p> kubernetes_sd_configs:</p><p> - role: pod</p><p> relabel_configs:</p><p> - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]</p><p> action: keep</p><p> regex: true</p><p> - source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port]</p><p> action: replace</p><p> target_label: <strong>address</strong></p><p> regex: (<sup id=“fnref-1”>1</sup>+)(?::d+)?;(d+)</p><p> replacement: $1:$2</p><p> - action: labelmap</p><p> regex: _meta_kubernetes_pod_label(.+)</p><p> - source_labels: [__meta_kubernetes_namespace]</p><p> action: replace</p><p> target_label: kubernetes_namespace</p><p> - source_labels: [__meta_kubernetes_pod_name]</p><p> action: replace</p><p> target_label: kubernetes_pod_name</p><p> - job_name: ‘kubernetes-cadvisor’</p><p> scheme: https</p><p> tls_config:</p><p> ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt</p><p> bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token</p><p> kubernetes_sd_configs:</p><p> - role: node</p><p> relabel_configs:</p><p> - action: labelmap</p><p> regex: _meta_kubernetes_node_label(.+)</p><p> - target_label: <strong>address</strong></p><p> replacement: kubernetes.default.svc:443</p><p> - source_labels: [__meta_kubernetes_node_name]</p><p> regex: (.+)</p><p> target_label: <strong>metrics_path</strong></p><p> replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor</p><p> - job_name: ‘kubernetes-service-endpoints’</p><p> kubernetes_sd_configs:</p><p> - role: endpoints</p><p> relabel_configs:</p><p> - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]</p><p> action: keep</p><p> regex: true</p><p> - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]</p><p> action: replace</p><p> target_label: <strong>scheme</strong></p><p> regex: (https?)</p><p> - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]</p><p> action: replace</p><p> target_label: <strong>metrics_path</strong></p><p> regex: (.+)</p><p> - source_labels: [address, __meta_kubernetes_service_annotation_prometheus_io_port]</p><p> action: replace</p><p> target_label: <strong>address</strong></p><p> regex: (<sup id=“fnref-1”>1</sup>+)(?::d+)?;(d+)</p><p> replacement: $1:$2</p><p> - action: labelmap</p><p> regex: _meta_kubernetes_service_label(.+)</p><p> - source_labels: [__meta_kubernetes_namespace]</p><p> action: replace</p><p> target_label: kubernetes_namespace</p><p> - source_labels: [__meta_kubernetes_service_name]</p><p> action: replace</p><p> target_label: kubernetes_name</p><p> rules.yml: |</p><p> groups:</p><p> - name: kebernetes_rules</p><p> rules:</p><p> - alert: InstanceDown</p><p> expr: up{job=“kubernetes-node-exporter”} == 0</p><p> for: 5m</p><p> labels:</p><p> severity: page</p><p> annotations:</p><p> summary: “Instance {{ $labels.instance }} down”</p><p> description: “{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."</p><p> - alert: APIHighRequestLatency</p><p> expr: api_http_request_latencies_second{quantile=“0.5”} > 1</p><p> for: 10m</p><p> annotations:</p><p> summary: “High request latency on {{ $labels.instance }}"</p><p> description: “{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"</p><p> - alert: StatefulSetReplicasMismatch</p><p> annotations:</p><p> summary: “Replicas miss match”</p><p> description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 3 minutes.</p><p> expr: label_join(kube_statefulset_status_replicas_ready != kube_statefulset_replicas, “instance”, “/”, “namespace”, “statefulset”)</p><p> for: 3m</p><p> labels:</p><p> severity: critical</p><p> - alert: PodFrequentlyRestarting</p><p> expr: increase(kube_pod_container_status_restarts_total[1h]) > 5</p><p> for: 5m</p><p> labels:</p><p> severity: warning</p><p> annotations:</p><p> description: Pod {{ $labels.namespaces }}/{{ $labels.pod }} is was restarted {{ $value }} times within the last hour</p><p> summary: Pod is restarting frequently</p><p> - alert: DeploymentReplicasNotUpdated</p><p> expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)</p><p> or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))</p><p> unless (kube_deployment_spec_paused == 1)</p><p> for: 5m</p><p> labels:</p><p> severity: critical</p><p> annotations:</p><p> description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}</p><p> summary: Deployment replicas are outdated</p><p> - alert: DaemonSetRolloutStuck</p><p> expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100</p><p> for: 5m</p><p> labels:</p><p> severity: critical</p><p> annotations:</p><p> description: Only {{ $value }}% of desired pods scheduled and ready for daemonset {{ $labels.namespace }}/{{ $labels.daemonset }}</p><p> summary: DaemonSet is missing pods</p><p> - alert: DaemonSetsNotScheduled</p><p> expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0</p><p> for: 10m</p><p> labels:</p><p> severity: warning</p><p> annotations:</p><p> description: ‘{{<code>{{ $value }}</code>}} Pods of DaemonSet {{<code>{{ $labels.namespace }}</code>}}/{{<code>{{ $labels.daemonset }}</code>}} are not scheduled.’</p><p> summary: Daemonsets are not scheduled correctly</p><p> - alert: DaemonSetsMissScheduled</p><p> expr: kube_daemonset_status_number_misscheduled > 0</p><p> for: 10m</p><p> labels:</p><p> severity: warning</p><p> annotations:</p><p> description: ‘{{<code>{{ $value }}</code>}} Pods of DaemonSet {{<code>{{ $labels.namespace }}</code>}}/{{<code>{{ $labels.daemonset }}</code>}} are running where they are not supposed to run.’</p><p> summary: Daemonsets are not scheduled correctly</p><p> - alert: Node_Boot_Time</p><p> expr: (node_time_seconds - node_boot_time_seconds) <= 150</p><p> for: 15s</p><p> annotations:</p><p> summary: “机器{{ $labels.instacnce }} 刚刚重启,工夫少于 150s”</p><p> - alert: Available_Percent</p><p> expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes <= 0.2</p><p> for: 15s</p><p> annotations:</p><p> summary: “机器{{ $labels.instacnce }} available less than 20%"</p><p> - alert: FD_Used_Percent</p><p> expr: (node_filefd_allocated / node_filefd_maximum) >= 0.8</p><p> for: 15s</p><p> annotations:</p><p> summary: “机器{{ $labels.instacnce }} FD used more than 80%"</p><p></p><p>依据刚刚创立的cm的要求,创立alertmanger 用于告警</p><p>vi alertmanger.yml</p><p><code></code></p><hr/><p>kind: Service</p><p>apiVersion: v1</p><p>metadata:</p><p> name: gwj-alertmanger-svc</p><p> namespace: gwj</p><p>spec:</p><p> selector:</p><p> app: gwj-alert-pod</p><p> ports:</p><p> - protocol: TCP</p><p> port: 80</p><p> targetPort: 9093</p><hr/><p>apiVersion: apps/v1</p><p>kind: StatefulSet</p><p>metadata:</p><p> name: gwj-alert-sts</p><p> namespace: gwj</p><p> labels:</p><p> app: gwj-alert-sts</p><p>spec:</p><p> replicas: 1</p><p> serviceName: gwj-alertmanger-svc</p><p> selector:</p><p> matchLabels:</p><p> app: gwj-alert-pod</p><p> template:</p><p> metadata:</p><p> labels:</p><p> app: gwj-alert-pod</p><p> spec:</p><p> containers:</p><p> - image: prom/alertmanager:v0.14.0</p><p> name: gwj-alert-pod</p><p> ports:</p><p> - containerPort: 9093</p><p> protocol: TCP</p><p> volumeMounts:</p><p> - mountPath: “/etc/localtime”</p><p> name: timezone</p><p> volumes:</p><p> - name: timezone</p><p> hostPath:</p><p> path: /etc/localtime</p><p></p><p>kubectl apply -f alertmanger.yml</p><p> service/gwj-alertmanger-svc created</p><p> statefulset.apps/gwj-alert-sts created</p><p>创立prometheus statefulset来创立prometheus</p><p>service/gwj-prometheus-svc created</p><p>statefulset.apps/gwj-prometheus-sts created</p><p>/prometheus</p><p>pvc: gwj-prometheus-pvc</p><p>/etc/prometheus/</p><p>configMap:</p><p> name: gwj-prometheus-cm</p><p>vi prometheus-sts.yml</p><p><code></code></p><hr/><p>kind: Service</p><p>apiVersion: v1</p><p>metadata:</p><p> name: gwj-prometheus-svc</p><p> namespace: gwj</p><p> labels:</p><p> app: gwj-prometheus-svc</p><p>spec:</p><p> ports:</p><p> - port: 80</p><p> targetPort: 9090</p><p> selector:</p><p> app: gwj-prometheus-pod</p><hr/><p>apiVersion: apps/v1</p><p>kind: StatefulSet</p><p>metadata:</p><p> name: gwj-prometheus-sts</p><p> namespace: gwj</p><p> labels:</p><p> app: gwj-prometheus-sts</p><p>spec:</p><p> replicas: 1</p><p> serviceName: gwj-prometheus-svc</p><p> selector:</p><p> matchLabels:</p><p> app: gwj-prometheus-pod</p><p> template:</p><p> metadata:</p><p> labels:</p><p> app: gwj-prometheus-pod</p><p> spec:</p><p> containers:</p><p> - image: prom/prometheus:v2.9.2</p><p> name: gwj-prometheus-pod</p><p> ports:</p><p> - containerPort: 9090</p><p> protocol: TCP</p><p> volumeMounts:</p><p> - mountPath: “/prometheus”</p><p> name: data</p><p> - mountPath: “/etc/prometheus/"</p><p> name: config-volume</p><p> - mountPath: “/etc/localtime”</p><p> name: timezone</p><p> resources:</p><p> requests:</p><p> cpu: 100m</p><p> memory: 100Mi</p><p> limits:</p><p> cpu: 500m</p><p> memory: 2000Mi</p><p> serviceAccountName: gwj-prometheus</p><p> volumes:</p><p> - name: data</p><p> persistentVolumeClaim:</p><p> claimName: gwj-prometheus-pvc</p><p> - name: config-volume</p><p> configMap:</p><p> name: gwj-prometheus-cm</p><p> - name: gwj-prometheus-rule-cm</p><p> configMap:</p><p> name: gwj-prometheus-rule-cm</p><p> - name: timezone</p><p> hostPath:</p><p> path: /etc/localtime</p><p></p><p>kubectl apply -f prometheus-sts.yml</p><p> service/gwj-prometheus-svc created</p><p> statefulset.apps/gwj-prometheus-sts created</p><p>创立ingress,依据域名散发到不同的service</p><p>vi prometheus-ingress.yml</p><p><code></code></p><hr/><p>apiVersion: extensions/v1beta1</p><p>kind: Ingress</p><p>metadata:</p><p> namespace: gwj</p><p> annotations:</p><p> name: gwj-ingress-prometheus</p><p>spec:</p><p> rules:</p><p> - host: gwj.syncbug.com</p><p> http:</p><p> paths:</p><p> - path: /</p><p> backend:</p><p> serviceName: gwj-prometheus-svc</p><p> servicePort: 80</p><p> - host: gwj-alert.syncbug.com</p><p> http:</p><p> paths:</p><p> - path: /</p><p> backend:</p><p> serviceName: gwj-alertmanger-svc</p><p> servicePort: 80</p><p></p><p>kubectl apply -f prometheus-ingress.yml</p><p> ingress.extensions/gwj-ingress-prometheus created</p><p>拜访对应的域名</p><p>gwj.syncbug.com</p><p>查看指标对象是否正确</p><p>http://gwj.syncbug.com/targets</p><p>查看配置文件是否正确</p><p>http://gwj.syncbug.com/config</p><p>gwj-alert.syncbug.com</p><p>===grafana</p><p>vi grafana-pv.yaml</p><p><code></code></p><p>apiVersion: v1</p><p>kind: PersistentVolume</p><p>metadata:</p><p> name: gwj-pv-grafana</p><p> labels:</p><p> app: gwj-pv-gra</p><p>spec:</p><p> capacity:</p><p> storage: 2Gi</p><p> volumeMode: Filesystem</p><p> accessModes:</p><p> - ReadWriteMany</p><p> persistentVolumeReclaimPolicy: Recycle</p><p> storageClassName: slow</p><p> mountOptions:</p><p> - hard</p><p> - nfsvers=4.1</p><p> nfs:</p><p> path: /storage/gwj-grafana</p><p> server: 10.1.99.1</p><p></p><p>vi grafana-pvc.yaml</p><p><code></code></p><p>apiVersion: v1</p><p>kind: PersistentVolumeClaim</p><p>metadata:</p><p> name: gwj-grafana-pvc</p><p> namespace: gwj</p><p>spec:</p><p> accessModes:</p><p> - ReadWriteMany</p><p> volumeMode: Filesystem</p><p> resources:</p><p> requests:</p><p> storage: 1Gi</p><p> selector:</p><p> matchLabels:</p><p> app: gwj-pv-gra</p><p> storageClassName: slow</p><p></p><p>vi grafana-deployment.yaml</p><p><code></code></p><p>apiVersion: extensions/v1beta1</p><p>kind: Deployment</p><p>metadata:</p><p> labels:</p><p> name: grafana</p><p> name: grafana</p><p> namespace: gwj</p><p>spec:</p><p> replicas: 1</p><p> revisionHistoryLimit: 10</p><p> selector:</p><p> matchLabels:</p><p> app: grafana</p><p> template:</p><p> metadata:</p><p> labels:</p><p> app: grafana</p><p> name: grafana</p><p> spec:</p><p> containers:</p><p> - env:</p><p> - name: GF_PATHS_DATA</p><p> value: /var/lib/grafana/</p><p> - name: GF_PATHS_PLUGINS</p><p> value: /var/lib/grafana/plugins</p><p> image: grafana/grafana:6.2.4</p><p> imagePullPolicy: IfNotPresent</p><p> name: grafana</p><p> ports:</p><p> - containerPort: 3000</p><p> name: grafana</p><p> protocol: TCP</p><p> volumeMounts:</p><p> - mountPath: /var/lib/grafana/</p><p> name: data</p><p> - mountPath: /etc/localtime</p><p> name: localtime</p><p> dnsPolicy: ClusterFirst</p><p> restartPolicy: Always</p><p> volumes:</p><p> - name: data</p><p> persistentVolumeClaim:</p><p> claimName: gwj-grafana-pvc</p><p> - name: localtime</p><p> hostPath:</p><p> path: /etc/localtime</p><p></p><p>vi grafana-ingress.yaml</p><p><code></code></p><hr/><p>apiVersion: extensions/v1beta1</p><p>kind: Ingress</p><p>metadata:</p><p> namespace: gwj</p><p> annotations:</p><p> name: gwj-ingress-grafana</p><p>spec:</p><p> rules:</p><p> - host: gwj-grafana.syncbug.com</p><p> http:</p><p> paths:</p><p> - path: /</p><p> backend:</p><p> serviceName: gwj-grafana-svc</p><p> servicePort: 80</p><hr/><p>kind: Service</p><p>apiVersion: v1</p><p>metadata:</p><p> name: gwj-grafana-svc</p><p> namespace: gwj</p><p>spec:</p><p> selector:</p><p> app: grafana</p><p> ports:</p><p> - protocol: TCP</p><p> port: 80</p><p> targetPort: 3000</p><p></p><p>进入grafana,gwj-grafana.syncbug.com</p><p>默认: admin admin</p><p>输出datasource: http://gwj-prometheus-svc:80</p><p>import模版</p><hr/><ol><li id=“fn-1”>: ↩</li></ol></article>