关于运维:TKE-1206搭建KubePrometheusprometheusoprator

34次阅读

共计 9758 个字符,预计需要花费 25 分钟才能阅读完成。

背景:

线上开明了 tke1.20.6 的集群。嗯腾讯云有个原生的 Prometheus 的监控,开明了应用了一下。不过我没有怎么玩明确。文档也不全。还是想了下本人手动搭建一个 Prometheus-oprator 吧!
根本过程根本参照:Kubernetes 1.20.5 装置 Prometheus-Oprator。上面讲一下不一样的和须要留神的

过程以及根本留神的:

1. 前提反复操作

1.1-1.4 操作根本保留都没有问题!

2. 增加 kubeControllerManager kubeScheduler 监控

拜访了一下 Prometheus 页面和前几个版本一样仍然木有 kube-scheduler 和 kube-controller-manager 的监控。然而没有搞明确 为什么 kube-apiserver 只有两个呢?两个 apiserver 169 结尾的 ip 形式也有些让我惊讶 ……

先再 master 节点执行了下 netstat 命令发现 tke 原生监控的都是 ipv6 的地址不是 127.0.0.1 的,我也就疏忽了批改 control-manager 和 scheduler 配置文件了!

netstat -ntlp


这里没有批改 kube-controller-manager.yaml kube-scheduler.yaml 的配置文件,顺便看了一眼 /etc/kubernetes/manifests 目录,what?还有 cilium 的包?tke 1.20.6 是不是也是用了 cilium?

部署一下 control-manager 和 scheduler 的 service 服务:

cat <<EOF > kube-controller-manager-scheduler.yml
apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-controller-manager
  labels:
    app.kubernetes.io/name: kube-controller-manager
spec:
  selector:
    component: kube-controller-manager
  type: ClusterIP
  clusterIP: None
  ports:
  - name: https-metrics
    port: 10257
    targetPort: 10257
    protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-scheduler
  labels:
    app.kubernetes.io/name: kube-scheduler
spec:
  selector:
    component: kube-scheduler
  type: ClusterIP
  clusterIP: None
  ports:
  - name: https-metrics
    port: 10259
    targetPort: 10259
    protocol: TCP
EOF
 kubectl apply -f kube-controller-manager-scheduler.yml
kubectl get svc -n kube-system


开启一下 endpoints:

cat <<EOF > kube-ep.yml
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manager
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.0.4.25
  - ip: 10.0.4.24
  - ip: 10.0.4.38
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.0.4.25
  - ip: 10.0.4.24
  - ip: 10.0.4.38
  ports:
  - name: https-metrics
    port: 10259
    protocol: TCP
EOF
 kubectl apply -f kube-ep.yml
kubectl get ep -n kube-system

登陆 Prometheus 验证:

why?control-manager 都起来了 kube-schedulerj 监控状态都是是 down 啊?
开始排查一下:
在 manifests 目录下(这一步一点要认真看下新版的 matchLabels 产生了扭转)

grep -A2 -B2  selector kubernetes-serviceMonitor*


看一眼 kube-system 下 pod 的标签:

kubectl get pods -n kube-system --show-labels


不晓得为什么,tke kubernetes 根本组件的 labels 都没有?特意看了一眼我本人搭建的集群,以 scheduler 为例:

kubectl get pods -n kube-schduler-k8s-master-01 -n kube-system --show-labels


特地想晓得都喊云原生,标签这些货色能不能放弃一下?否则让个别小白用户拍错真的是很难!标签没有我是不是能够手动增加一下?

kubectl label pod  kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3 -n kube-system app.kubernetes.io/name=kube-scheduler
kubectl label pod  kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3  -n kube-system component=kube-scheduler
kubectl label pod  kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3  -n kube-system k8s-app=kube-scheduler



Prometheus 页面仍然如此:

扎心了!这个时候看到 netstat 页面:

嗯?它默认开启了非平安端口?10251?那我改一下 10251 试试?(尽管忘了官网从 1.17 还是哪个版本就默认值开明 10259 了吧?也不晓得 tke 这里还保留开明这个端口的起因是什么)
从新生成一下 scheduler 的 service endpoint 服务:

cat <<EOF > kube-scheduler.yaml
apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-scheduler
  labels:
    app.kubernetes.io/name: kube-scheduler
spec:
  selector:
    app.kubernetes.io/name: kube-scheduler
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 10251
    targetPort: 10251
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.0.4.25
  - ip: 10.0.4.24
  - ip: 10.0.4.38
  ports:
  - name: http-metrics
    port: 10251
    protocol: TCP
---
EOF
kubectl apply -f kube-scheduler.yaml

从新整一下 scheduler 的 serviceMonitorKubeScheduler:

cat <<EOF > kubernetes-serviceMonitorKubeScheduler.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/name: kube-scheduler
  name: kube-scheduler
  namespace: monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: http-metrics
    scheme: http
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-scheduler
EOF
kubectl apply -f kubernetes-serviceMonitorKubeScheduler.yaml


算是曲线救国吧 …. 先糊弄过来吧 ……

3. ECTD 的监控

tke 的证书跟原生集群的地位名字是不一样的,如下:

root@ap-shanghai-k8s-master-1:/etc/etcd/certs# ls /etc/etcd/certs
etcd-cluster.crt  etcd-node.crt  etcd-node.key
root@ap-shanghai-k8s-master-1:/etc/etcd/certs# kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/etcd/certs/etcd-node.crt --from-file=/etc/etcd/certs/etcd-node.key --from-file=/etc/etcd/certs/etcd-cluster.crt


批改 Prometheus-Prometheus.yaml 增加 secrets

secrets
- etcd-certs

kubectl apply -f prometheus-prometheus.yaml
kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring
ls /etc/prometheus/secrets/etcd/certs/

cat <<EOF > kube-ep-etcd.yml
apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  namespace: kube-system
  labels:
    k8s-app: etcd
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: etcd
    port: 2379
    protocol: TCP

---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: etcd
  name: etcd-k8s
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.0.4.25
  - ip: 10.0.4.24
  - ip: 10.0.4.38
  ports:
  - name: etcd
    port: 2379
    protocol: TCP
---
EOF
 kubectl apply -f kube-ep-etcd.yml
cat <<EOF > prometheus-serviceMonitorEtcd.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd-k8s
  namespace: monitoring
  labels:
    k8s-app: etcd
spec:
  jobLabel: k8s-app
  endpoints:
  - port: etcd
    interval: 30s
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/secrets/etcd-certs/etcd-cluster.crt
      certFile: /etc/prometheus/secrets/etcd-certs/etcd-node.crt
      keyFile: /etc/prometheus/secrets/etcd-certs/etcd-node.key
      insecureSkipVerify: true
  selector:
    matchLabels:
      k8s-app: etcd
  namespaceSelector:
    matchNames:
    - kube-system
EOF
 kubectl apply -f prometheus-serviceMonitorEtcd.yaml    


prometheus web 验证:

etcd 的监控就也算做好了

4. prometheus 配置文件批改为正式

1. 增加主动发现配置

网上轻易抄 了一个:

cat <<EOF > prometheus-additional.yaml
- job_name: 'kubernetes-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
EOF

留神:cat <<EOF > 后 replacement: $1:$2 会变成 replacement: : 记得本人手动更改一下!

kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring

2. 减少存储 保留工夫 etcd secret

cat <<EOF > prometheus-prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.28.1
    prometheus: k8s
  name: k8s
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - apiVersion: v2
      name: alertmanager-main
      namespace: monitoring
      port: web
  enableFeatures: []
  externalLabels: {}
  image: quay.io/prometheus/prometheus:v2.28.1
  nodeSelector:
    kubernetes.io/os: linux
  podMetadata:
    labels:
      app.kubernetes.io/component: prometheus
      app.kubernetes.io/name: prometheus
      app.kubernetes.io/part-of: kube-prometheus
      app.kubernetes.io/version: 2.28.1
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  probeNamespaceSelector: {}
  probeSelector: {}
  replicas: 2
  resources:
    requests:
      memory: 400Mi
  secrets:
  - etcd-certs
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  additionalScrapeConfigs:
     name: additional-configs
     key: prometheus-additional.yaml
  serviceAccountName: prometheus-k8s
  retention: 60d
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: 2.28.1
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: cbs
        resources:
          requests:
            storage: 50Gi
EOF
 kubectl apply -f prometheus-prometheus.yaml

3. clusterrole 还的说一下

kubectl logs -f prometheus-k8s-0 prometheus -n monitoring


仍然是 clusterrole 的问题:

cat <<EOF > prometheus-prometheus.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.28.1
  name: prometheus-k8s
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - services
  - endpoints
  - pods
  - nodes/proxy         
  - nodes/metrics
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
EOF
kubectl apply -f prometheus-prometheus.yaml

5. grafana 增加长久化存储

注:其实 grafana 我都能够不装置的我想用 Kubernetes 1.20.5 装置 Prometheus-Oprator 中搭建的 grafana 做汇总。而且两个集群是在一个 vpc 的!总比搭建一个 thanos 好多了 … 至于 thanos 我还要有工夫了钻研一下。这两个集群规模当初都是十几台这样压力应该还是不大的!

cat <<EOF > grafana-pv.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana
  namespace: monitoring
spec:
  storageClassName: cbs
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
EOF
 kubectl apply -f grafana-pv.yaml
 批改 manifests 目录下 grafana-deployment.yaml 存储
      volumes:
      - name: grafana-storage
        persistentVolumeClaim:
          claimName: grafana

kubectl apply -f grafana-deployment.yaml

其余局部根本就同 Kubernetes 1.20.5 装置 Prometheus-Oprator 了总算是跑了起来!
嗯 tke 集群的 domain 默认就是 cluster.local 所以这个中央是不必批改的!

6. 另外一个集群的 grafana 增加本集群的 Prometheus

应用 Kubernetes 1.20.5 装置 Prometheus-Oprator 中的 grafana 增加本 Prometheus 集群的数据源:

测试通过 ok!
注:当然了原本内网是通的 我能够不将 Prometheus 等服务对外的,能够间接批改 prometheus-k8s 的 service?试一下!


验证一下:

kubectl get svc -n monitoring


关上 grafana-configration-data sources 批改 Prometheus- 1 配置 url,期待验证通过保留:

save 保留一下批改后的 DataSource!

关上 grafana 默认 kubernetes 模板 DataSource 选项发现有两个数据源了。能够切换并查看相干的监控图表!

看一下 tke 的 kube-system 监控:

集体搭建的 kubeadm 1.21+cilium 集群:

至于监控报警就都跟 Kubernetes 1.20.5 装置 Prometheus-Oprator 一样了。我这里就只是简略的想让 grafana 增加两个数据源 …thanos 有工夫了再体验一下了!

正文完
 0