背景:

线上开明了tke1.20.6的集群。嗯腾讯云有个原生的Prometheus的监控,开明了应用了一下。不过我没有怎么玩明确。文档也不全。还是想了下本人手动搭建一个Prometheus-oprator吧!
根本过程根本参照:Kubernetes 1.20.5 装置Prometheus-Oprator。上面讲一下不一样的和须要留神的

过程以及根本留神的:

1.前提反复操作

1.1-1.4操作根本保留都没有问题!

2. 增加 kubeControllerManager kubeScheduler监控

拜访了一下Prometheus页面和前几个版本一样仍然木有kube-scheduler 和 kube-controller-manager 的监控。然而没有搞明确 为什么kube-apiserver只有两个呢?两个apiserver 169结尾的ip形式也有些让我惊讶......

先再master节点执行了下netstat命令发现tke原生监控的都是ipv6的地址不是127.0.0.1的,我也就疏忽了批改control-manager和scheduler配置文件了!

netstat -ntlp


这里没有批改kube-controller-manager.yaml kube-scheduler.yaml的配置文件,顺便看了一眼/etc/kubernetes/manifests目录,what?还有cilium的包? tke 1.20.6是不是也是用了cilium?

部署一下control-manager和scheduler的service服务:

cat <<EOF > kube-controller-manager-scheduler.ymlapiVersion: v1kind: Servicemetadata:  namespace: kube-system  name: kube-controller-manager  labels:    app.kubernetes.io/name: kube-controller-managerspec:  selector:    component: kube-controller-manager  type: ClusterIP  clusterIP: None  ports:  - name: https-metrics    port: 10257    targetPort: 10257    protocol: TCP---apiVersion: v1kind: Servicemetadata:  namespace: kube-system  name: kube-scheduler  labels:    app.kubernetes.io/name: kube-schedulerspec:  selector:    component: kube-scheduler  type: ClusterIP  clusterIP: None  ports:  - name: https-metrics    port: 10259    targetPort: 10259    protocol: TCPEOF kubectl apply -f kube-controller-manager-scheduler.yml
kubectl get svc -n kube-system


开启一下endpoints:

cat <<EOF > kube-ep.ymlapiVersion: v1kind: Endpointsmetadata:  labels:    k8s-app: kube-controller-manager  name: kube-controller-manager  namespace: kube-systemsubsets:- addresses:  - ip: 10.0.4.25  - ip: 10.0.4.24  - ip: 10.0.4.38  ports:  - name: https-metrics    port: 10257    protocol: TCP---apiVersion: v1kind: Endpointsmetadata:  labels:    k8s-app: kube-scheduler  name: kube-scheduler  namespace: kube-systemsubsets:- addresses:  - ip: 10.0.4.25  - ip: 10.0.4.24  - ip: 10.0.4.38  ports:  - name: https-metrics    port: 10259    protocol: TCPEOF kubectl apply -f kube-ep.yml
kubectl get ep -n kube-system

登陆Prometheus验证:

why?control-manager都起来了 kube-schedulerj监控状态都是是down啊?
开始排查一下:
在manifests目录下(这一步一点要认真看下新版的matchLabels产生了扭转)

grep -A2 -B2  selector kubernetes-serviceMonitor*


看一眼kube-system下pod的标签:

kubectl get pods -n kube-system --show-labels


不晓得为什么,tke kubernetes根本组件的labels都没有?特意看了一眼我本人搭建的集群,以scheduler为例:

kubectl get pods -n kube-schduler-k8s-master-01 -n kube-system --show-labels


特地想晓得都喊云原生,标签这些货色能不能放弃一下?否则让个别小白用户拍错真的是很难!标签没有我是不是能够手动增加一下?

kubectl label pod  kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3 -n kube-system app.kubernetes.io/name=kube-schedulerkubectl label pod  kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3  -n kube-system component=kube-schedulerkubectl label pod  kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3  -n kube-system k8s-app=kube-scheduler



Prometheus页面仍然如此:

扎心了!这个时候看到netstat页面:

嗯? 它默认开启了非平安端口?10251?那我改一下10251试试?(尽管忘了官网从1.17还是哪个版本就默认值开明10259了吧?也不晓得tke这里还保留开明这个端口的起因是什么)
从新生成一下scheduler的service endpoint服务:

cat <<EOF > kube-scheduler.yamlapiVersion: v1kind: Servicemetadata:  namespace: kube-system  name: kube-scheduler  labels:    app.kubernetes.io/name: kube-schedulerspec:  selector:    app.kubernetes.io/name: kube-scheduler  type: ClusterIP  clusterIP: None  ports:  - name: http-metrics    port: 10251    targetPort: 10251    protocol: TCP---apiVersion: v1kind: Endpointsmetadata:  labels:    k8s-app: kube-scheduler  name: kube-scheduler  namespace: kube-systemsubsets:- addresses:  - ip: 10.0.4.25  - ip: 10.0.4.24  - ip: 10.0.4.38  ports:  - name: http-metrics    port: 10251    protocol: TCP---EOFkubectl apply -f kube-scheduler.yaml

从新整一下scheduler的serviceMonitorKubeScheduler:

cat <<EOF > kubernetes-serviceMonitorKubeScheduler.yamlapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata:  labels:    app.kubernetes.io/name: kube-scheduler  name: kube-scheduler  namespace: monitoringspec:  endpoints:  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token    interval: 30s    port: http-metrics    scheme: http    tlsConfig:      insecureSkipVerify: true  jobLabel: app.kubernetes.io/name  namespaceSelector:    matchNames:    - kube-system  selector:    matchLabels:      app.kubernetes.io/name: kube-schedulerEOFkubectl apply -f kubernetes-serviceMonitorKubeScheduler.yaml


算是曲线救国吧....先糊弄过来吧......

3. ECTD的监控

tke的证书跟原生集群的地位名字是不一样的,如下:

root@ap-shanghai-k8s-master-1:/etc/etcd/certs# ls /etc/etcd/certsetcd-cluster.crt  etcd-node.crt  etcd-node.keyroot@ap-shanghai-k8s-master-1:/etc/etcd/certs# kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/etcd/certs/etcd-node.crt --from-file=/etc/etcd/certs/etcd-node.key --from-file=/etc/etcd/certs/etcd-cluster.crt


批改Prometheus-Prometheus.yaml增加secrets

secrets- etcd-certs

kubectl apply -f prometheus-prometheus.yamlkubectl exec -it prometheus-k8s-0 /bin/sh -n monitoringls /etc/prometheus/secrets/etcd/certs/

cat <<EOF > kube-ep-etcd.ymlapiVersion: v1kind: Servicemetadata:  name: etcd-k8s  namespace: kube-system  labels:    k8s-app: etcdspec:  type: ClusterIP  clusterIP: None  ports:  - name: etcd    port: 2379    protocol: TCP---apiVersion: v1kind: Endpointsmetadata:  labels:    k8s-app: etcd  name: etcd-k8s  namespace: kube-systemsubsets:- addresses:  - ip: 10.0.4.25  - ip: 10.0.4.24  - ip: 10.0.4.38  ports:  - name: etcd    port: 2379    protocol: TCP---EOF kubectl apply -f kube-ep-etcd.yml
cat <<EOF > prometheus-serviceMonitorEtcd.yamlapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata:  name: etcd-k8s  namespace: monitoring  labels:    k8s-app: etcdspec:  jobLabel: k8s-app  endpoints:  - port: etcd    interval: 30s    scheme: https    tlsConfig:      caFile: /etc/prometheus/secrets/etcd-certs/etcd-cluster.crt      certFile: /etc/prometheus/secrets/etcd-certs/etcd-node.crt      keyFile: /etc/prometheus/secrets/etcd-certs/etcd-node.key      insecureSkipVerify: true  selector:    matchLabels:      k8s-app: etcd  namespaceSelector:    matchNames:    - kube-systemEOF kubectl apply -f prometheus-serviceMonitorEtcd.yaml    


prometheus web验证:

etcd的监控就也算做好了

4. prometheus配置文件批改为正式

1. 增加主动发现配置

网上轻易抄 了一个:

cat <<EOF > prometheus-additional.yaml- job_name: 'kubernetes-endpoints'  kubernetes_sd_configs:  - role: endpoints  relabel_configs:  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]    action: keep    regex: true  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]    action: replace    target_label: __scheme__    regex: (https?)  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]    action: replace    target_label: __metrics_path__    regex: (.+)  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]    action: replace    target_label: __address__    regex: ([^:]+)(?::\d+)?;(\d+)    replacement: $1:$2  - action: labelmap    regex: __meta_kubernetes_service_label_(.+)  - source_labels: [__meta_kubernetes_namespace]    action: replace    target_label: kubernetes_namespace  - source_labels: [__meta_kubernetes_service_name]    action: replace    target_label: kubernetes_name  - source_labels: [__meta_kubernetes_pod_name]    action: replace    target_label: kubernetes_pod_nameEOF

留神:cat <<EOF >后replacement: $1:$2 会变成replacement: : 记得本人手动更改一下!

kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring

2. 减少存储 保留工夫 etcd secret

cat <<EOF > prometheus-prometheus.yamlapiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata:  labels:    app.kubernetes.io/component: prometheus    app.kubernetes.io/name: prometheus    app.kubernetes.io/part-of: kube-prometheus    app.kubernetes.io/version: 2.28.1    prometheus: k8s  name: k8s  namespace: monitoringspec:  alerting:    alertmanagers:    - apiVersion: v2      name: alertmanager-main      namespace: monitoring      port: web  enableFeatures: []  externalLabels: {}  image: quay.io/prometheus/prometheus:v2.28.1  nodeSelector:    kubernetes.io/os: linux  podMetadata:    labels:      app.kubernetes.io/component: prometheus      app.kubernetes.io/name: prometheus      app.kubernetes.io/part-of: kube-prometheus      app.kubernetes.io/version: 2.28.1  podMonitorNamespaceSelector: {}  podMonitorSelector: {}  probeNamespaceSelector: {}  probeSelector: {}  replicas: 2  resources:    requests:      memory: 400Mi  secrets:  - etcd-certs  ruleNamespaceSelector: {}  ruleSelector:    matchLabels:      prometheus: k8s      role: alert-rules  securityContext:    fsGroup: 2000    runAsNonRoot: true    runAsUser: 1000  additionalScrapeConfigs:     name: additional-configs     key: prometheus-additional.yaml  serviceAccountName: prometheus-k8s  retention: 60d  serviceMonitorNamespaceSelector: {}  serviceMonitorSelector: {}  version: 2.28.1  storage:    volumeClaimTemplate:      spec:        storageClassName: cbs        resources:          requests:            storage: 50GiEOF kubectl apply -f prometheus-prometheus.yaml

3. clusterrole还的说一下

kubectl logs -f prometheus-k8s-0 prometheus -n monitoring


仍然是clusterrole的问题:

cat <<EOF > prometheus-prometheus.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:  labels:    app.kubernetes.io/component: prometheus    app.kubernetes.io/name: prometheus    app.kubernetes.io/part-of: kube-prometheus    app.kubernetes.io/version: 2.28.1  name: prometheus-k8srules:- apiGroups:  - ""  resources:  - nodes  - services  - endpoints  - pods  - nodes/proxy           - nodes/metrics  verbs:  - get  - list  - watch- apiGroups:  - ""  resources:  - configmaps  - nodes/metrics  verbs:  - get- nonResourceURLs:  - /metrics  verbs:  - getEOFkubectl apply -f prometheus-prometheus.yaml

5. grafana增加长久化存储

注:其实grafana我都能够不装置的我想用Kubernetes 1.20.5 装置Prometheus-Oprator中搭建的grafana做汇总。而且两个集群是在一个vpc的!总比搭建一个thanos好多了...至于thanos我还要有工夫了钻研一下。这两个集群规模当初都是十几台这样压力应该还是不大的!

cat <<EOF > grafana-pv.yamlapiVersion: v1kind: PersistentVolumeClaimmetadata:  name: grafana  namespace: monitoringspec:  storageClassName: cbs  accessModes:  - ReadWriteOnce  resources:    requests:      storage: 20GiEOF kubectl apply -f grafana-pv.yaml
批改manifests目录下grafana-deployment.yaml存储      volumes:      - name: grafana-storage        persistentVolumeClaim:          claimName: grafana

kubectl apply -f grafana-deployment.yaml

其余局部根本就同Kubernetes 1.20.5 装置Prometheus-Oprator了总算是跑了起来!
嗯 tke集群的domain 默认就是cluster.local所以这个中央是不必批改的!

6. 另外一个集群的grafana增加本集群的Prometheus

应用Kubernetes 1.20.5 装置Prometheus-Oprator中的grafana增加本Prometheus集群的数据源:

测试通过ok!
注:当然了原本内网是通的 我能够不将Prometheus等服务对外的,能够间接批改prometheus-k8s的service?试一下!


验证一下:

kubectl get svc -n monitoring


关上grafana-configration-data sources 批改Prometheus-1配置url,期待验证通过保留:

save保留一下批改后的DataSource!

关上grafana默认kubernetes模板DataSource选项发现有两个数据源了。能够切换并查看相干的监控图表!

看一下tke的kube-system监控:

集体搭建的kubeadm 1.21+cilium集群:

至于监控报警就都跟Kubernetes 1.20.5 装置Prometheus-Oprator一样了。我这里就只是简略的想让grafana增加两个数据源...thanos有工夫了再体验一下了!