背景:
线上开明了tke1.20.6的集群。嗯腾讯云有个原生的Prometheus的监控,开明了应用了一下。不过我没有怎么玩明确。文档也不全。还是想了下本人手动搭建一个Prometheus-oprator吧!
根本过程根本参照:Kubernetes 1.20.5 装置Prometheus-Oprator。上面讲一下不一样的和须要留神的
过程以及根本留神的:
1.前提反复操作
1.1-1.4操作根本保留都没有问题!
2. 增加 kubeControllerManager kubeScheduler监控
拜访了一下Prometheus页面和前几个版本一样仍然木有kube-scheduler 和 kube-controller-manager 的监控。然而没有搞明确 为什么kube-apiserver只有两个呢?两个apiserver 169结尾的ip形式也有些让我惊讶......
先再master节点执行了下netstat命令发现tke原生监控的都是ipv6的地址不是127.0.0.1的,我也就疏忽了批改control-manager和scheduler配置文件了!
netstat -ntlp
这里没有批改kube-controller-manager.yaml kube-scheduler.yaml的配置文件,顺便看了一眼/etc/kubernetes/manifests目录,what?还有cilium的包? tke 1.20.6是不是也是用了cilium?
部署一下control-manager和scheduler的service服务:
cat <<EOF > kube-controller-manager-scheduler.ymlapiVersion: v1kind: Servicemetadata: namespace: kube-system name: kube-controller-manager labels: app.kubernetes.io/name: kube-controller-managerspec: selector: component: kube-controller-manager type: ClusterIP clusterIP: None ports: - name: https-metrics port: 10257 targetPort: 10257 protocol: TCP---apiVersion: v1kind: Servicemetadata: namespace: kube-system name: kube-scheduler labels: app.kubernetes.io/name: kube-schedulerspec: selector: component: kube-scheduler type: ClusterIP clusterIP: None ports: - name: https-metrics port: 10259 targetPort: 10259 protocol: TCPEOF kubectl apply -f kube-controller-manager-scheduler.yml
kubectl get svc -n kube-system
开启一下endpoints:
cat <<EOF > kube-ep.ymlapiVersion: v1kind: Endpointsmetadata: labels: k8s-app: kube-controller-manager name: kube-controller-manager namespace: kube-systemsubsets:- addresses: - ip: 10.0.4.25 - ip: 10.0.4.24 - ip: 10.0.4.38 ports: - name: https-metrics port: 10257 protocol: TCP---apiVersion: v1kind: Endpointsmetadata: labels: k8s-app: kube-scheduler name: kube-scheduler namespace: kube-systemsubsets:- addresses: - ip: 10.0.4.25 - ip: 10.0.4.24 - ip: 10.0.4.38 ports: - name: https-metrics port: 10259 protocol: TCPEOF kubectl apply -f kube-ep.yml
kubectl get ep -n kube-system
登陆Prometheus验证:
why?control-manager都起来了 kube-schedulerj监控状态都是是down啊?
开始排查一下:
在manifests目录下(这一步一点要认真看下新版的matchLabels产生了扭转)
grep -A2 -B2 selector kubernetes-serviceMonitor*
看一眼kube-system下pod的标签:
kubectl get pods -n kube-system --show-labels
不晓得为什么,tke kubernetes根本组件的labels都没有?特意看了一眼我本人搭建的集群,以scheduler为例:
kubectl get pods -n kube-schduler-k8s-master-01 -n kube-system --show-labels
特地想晓得都喊云原生,标签这些货色能不能放弃一下?否则让个别小白用户拍错真的是很难!标签没有我是不是能够手动增加一下?
kubectl label pod kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3 -n kube-system app.kubernetes.io/name=kube-schedulerkubectl label pod kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3 -n kube-system component=kube-schedulerkubectl label pod kube-scheduler-ap-shanghai-k8s-master-1 kube-scheduler-ap-shanghai-k8s-master-2 kube-scheduler-ap-shanghai-k8s-master-3 -n kube-system k8s-app=kube-scheduler
Prometheus页面仍然如此:
扎心了!这个时候看到netstat页面:
嗯? 它默认开启了非平安端口?10251?那我改一下10251试试?(尽管忘了官网从1.17还是哪个版本就默认值开明10259了吧?也不晓得tke这里还保留开明这个端口的起因是什么)
从新生成一下scheduler的service endpoint服务:
cat <<EOF > kube-scheduler.yamlapiVersion: v1kind: Servicemetadata: namespace: kube-system name: kube-scheduler labels: app.kubernetes.io/name: kube-schedulerspec: selector: app.kubernetes.io/name: kube-scheduler type: ClusterIP clusterIP: None ports: - name: http-metrics port: 10251 targetPort: 10251 protocol: TCP---apiVersion: v1kind: Endpointsmetadata: labels: k8s-app: kube-scheduler name: kube-scheduler namespace: kube-systemsubsets:- addresses: - ip: 10.0.4.25 - ip: 10.0.4.24 - ip: 10.0.4.38 ports: - name: http-metrics port: 10251 protocol: TCP---EOFkubectl apply -f kube-scheduler.yaml
从新整一下scheduler的serviceMonitorKubeScheduler:
cat <<EOF > kubernetes-serviceMonitorKubeScheduler.yamlapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: labels: app.kubernetes.io/name: kube-scheduler name: kube-scheduler namespace: monitoringspec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s port: http-metrics scheme: http tlsConfig: insecureSkipVerify: true jobLabel: app.kubernetes.io/name namespaceSelector: matchNames: - kube-system selector: matchLabels: app.kubernetes.io/name: kube-schedulerEOFkubectl apply -f kubernetes-serviceMonitorKubeScheduler.yaml
算是曲线救国吧....先糊弄过来吧......
3. ECTD的监控
tke的证书跟原生集群的地位名字是不一样的,如下:
root@ap-shanghai-k8s-master-1:/etc/etcd/certs# ls /etc/etcd/certsetcd-cluster.crt etcd-node.crt etcd-node.keyroot@ap-shanghai-k8s-master-1:/etc/etcd/certs# kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/etcd/certs/etcd-node.crt --from-file=/etc/etcd/certs/etcd-node.key --from-file=/etc/etcd/certs/etcd-cluster.crt
批改Prometheus-Prometheus.yaml增加secrets
secrets- etcd-certs
kubectl apply -f prometheus-prometheus.yamlkubectl exec -it prometheus-k8s-0 /bin/sh -n monitoringls /etc/prometheus/secrets/etcd/certs/
cat <<EOF > kube-ep-etcd.ymlapiVersion: v1kind: Servicemetadata: name: etcd-k8s namespace: kube-system labels: k8s-app: etcdspec: type: ClusterIP clusterIP: None ports: - name: etcd port: 2379 protocol: TCP---apiVersion: v1kind: Endpointsmetadata: labels: k8s-app: etcd name: etcd-k8s namespace: kube-systemsubsets:- addresses: - ip: 10.0.4.25 - ip: 10.0.4.24 - ip: 10.0.4.38 ports: - name: etcd port: 2379 protocol: TCP---EOF kubectl apply -f kube-ep-etcd.yml
cat <<EOF > prometheus-serviceMonitorEtcd.yamlapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: etcd-k8s namespace: monitoring labels: k8s-app: etcdspec: jobLabel: k8s-app endpoints: - port: etcd interval: 30s scheme: https tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/etcd-cluster.crt certFile: /etc/prometheus/secrets/etcd-certs/etcd-node.crt keyFile: /etc/prometheus/secrets/etcd-certs/etcd-node.key insecureSkipVerify: true selector: matchLabels: k8s-app: etcd namespaceSelector: matchNames: - kube-systemEOF kubectl apply -f prometheus-serviceMonitorEtcd.yaml
prometheus web验证:
etcd的监控就也算做好了
4. prometheus配置文件批改为正式
1. 增加主动发现配置
网上轻易抄 了一个:
cat <<EOF > prometheus-additional.yaml- job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_nameEOF
留神:cat <<EOF >后replacement: $1:$2 会变成replacement: : 记得本人手动更改一下!
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
2. 减少存储 保留工夫 etcd secret
cat <<EOF > prometheus-prometheus.yamlapiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.28.1 prometheus: k8s name: k8s namespace: monitoringspec: alerting: alertmanagers: - apiVersion: v2 name: alertmanager-main namespace: monitoring port: web enableFeatures: [] externalLabels: {} image: quay.io/prometheus/prometheus:v2.28.1 nodeSelector: kubernetes.io/os: linux podMetadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.28.1 podMonitorNamespaceSelector: {} podMonitorSelector: {} probeNamespaceSelector: {} probeSelector: {} replicas: 2 resources: requests: memory: 400Mi secrets: - etcd-certs ruleNamespaceSelector: {} ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 additionalScrapeConfigs: name: additional-configs key: prometheus-additional.yaml serviceAccountName: prometheus-k8s retention: 60d serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: 2.28.1 storage: volumeClaimTemplate: spec: storageClassName: cbs resources: requests: storage: 50GiEOF kubectl apply -f prometheus-prometheus.yaml
3. clusterrole还的说一下
kubectl logs -f prometheus-k8s-0 prometheus -n monitoring
仍然是clusterrole的问题:
cat <<EOF > prometheus-prometheus.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.28.1 name: prometheus-k8srules:- apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy - nodes/metrics verbs: - get - list - watch- apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get- nonResourceURLs: - /metrics verbs: - getEOFkubectl apply -f prometheus-prometheus.yaml
5. grafana增加长久化存储
注:其实grafana我都能够不装置的我想用Kubernetes 1.20.5 装置Prometheus-Oprator中搭建的grafana做汇总。而且两个集群是在一个vpc的!总比搭建一个thanos好多了...至于thanos我还要有工夫了钻研一下。这两个集群规模当初都是十几台这样压力应该还是不大的!
cat <<EOF > grafana-pv.yamlapiVersion: v1kind: PersistentVolumeClaimmetadata: name: grafana namespace: monitoringspec: storageClassName: cbs accessModes: - ReadWriteOnce resources: requests: storage: 20GiEOF kubectl apply -f grafana-pv.yaml
批改manifests目录下grafana-deployment.yaml存储 volumes: - name: grafana-storage persistentVolumeClaim: claimName: grafana
kubectl apply -f grafana-deployment.yaml
其余局部根本就同Kubernetes 1.20.5 装置Prometheus-Oprator了总算是跑了起来!
嗯 tke集群的domain 默认就是cluster.local所以这个中央是不必批改的!
6. 另外一个集群的grafana增加本集群的Prometheus
应用Kubernetes 1.20.5 装置Prometheus-Oprator中的grafana增加本Prometheus集群的数据源:
测试通过ok!
注:当然了原本内网是通的 我能够不将Prometheus等服务对外的,能够间接批改prometheus-k8s的service?试一下!
验证一下:
kubectl get svc -n monitoring
关上grafana-configration-data sources 批改Prometheus-1配置url,期待验证通过保留:
save保留一下批改后的DataSource!
关上grafana默认kubernetes模板DataSource选项发现有两个数据源了。能够切换并查看相干的监控图表!
看一下tke的kube-system监控:
集体搭建的kubeadm 1.21+cilium集群:
至于监控报警就都跟Kubernetes 1.20.5 装置Prometheus-Oprator一样了。我这里就只是简略的想让grafana增加两个数据源...thanos有工夫了再体验一下了!