背景
线上 kubernetes 集群为 1.16 版本 Prometheus oprator 分支为 0.4 对于 Prometheus oprator 与 kubernetes 版本对应关系如下图。可见 https://github.com/prometheus-operator/kube-prometheus.
注:Prometheus operator?kube-prometheus? kube-prometheus 就是 Prometheus 的一种 operator 的部署形式 ….Prometheus-operator 曾经改名为 Kube-promethues。
对于部署过程能够参考超级小豆丁大佬的笔记:http://www.mydlq.club/article/10/。Prometheus 这种架构图,在各位大佬的文章中都能够看到的 …….
先简略部署一下 Prometheus oprator(or 或者叫 kube-promethus)。实现微信报警的集成,其余的缓缓在生成环境中钻研。
根本过程就是 Prometheus oprator 增加存储,减少微信报警,内部 traefik 代理利用。
1. prometheus 环境的搭建
1. 克隆 prometheus-operator 仓库
git clone https://github.com/prometheus-operator/kube-prometheus.git
网络起因,常常会搞不下来的,还是间接下载 zip 包吧,其实装置版本的反对列表 kubernetes1.20 的版本能够应用 kube-prometheus 的 0.6 or 0.7 还有 HEAD 分支的任一分支。偷个懒间接用 HEAD 了。
记录一下 tag,当前有批改了也好能疾速批改了 让版本升级跟得上 …..
上传 zip 包解压缩
unzip kube-prometheus-main.zip
2. 依照快捷方式来一遍
cd kube-prometheus-main/
kubectl create -f manifests/setup
until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
kubectl create -f manifests/
kubectl get pods -n monitoring
3. imagepullbackoff
因为网络起因会呈现有些镜像下载不下来的问题,可墙外服务器下载镜像批改 tag 上传到 harbor,批改 yaml 文件中镜像为对应公有镜像仓库的标签 tag 解决(因为我的公有仓库用的腾讯云的仓库,当初跨地区上传镜像应该个人版的不能够了,所以我应用了 docker save 导出镜像的形式):
kubectl describe pods kube-state-metrics-56f988c7b6-qxqjn -n monitoring
1. 应用国外服务器下载镜像,并打包为 tar 包下载到本地。
docker pull k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0
docker save k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0 -o kube-state-metrics.tar
2. ctr 导入镜像
ctr -n k8s.io i import kube-state-metrics.tar
导入的只是一个工作节点这样,然而 kubernetes 原本就是保障高可用用性,如果这个 pod 漂移调度到其余节点呢?难道要加上节点亲和性?这个节点如果就解体了呢?每个节点都导入此镜像?新退出的节点呢?还是老老实实的上传到镜像仓库吧!
失常的流程应该是这样吧?
crictl images
ctr image tag k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0 ccr.ccs.tencentyun.com/k8s_containers/kube-state-metrics:v2.0.0-rc.0
然而为什么是 not found? 不晓得是不是标签格局问题 ….。反正就这样了,而后上传到镜像库,具体命令能够参考 https://blog.csdn.net/tongzidane/article/details/114587138 https://blog.csdn.net/liumiaocn/article/details/103320426/
(上传我的仓库权限还是有问题(仓库外面能够下载啊然而我 …….. 搞迷糊了),不先搞了间接导入了)
反正咋样把 kube-state-metrics-XXXX 启动起来就好了,有工夫好好钻研下 ctr crictl 命令 还是有点懵。
3. 验证服务都失常启动
kubectl get pod -n monitoring
kubectl get svc -n monitoring
4. 应用 traefik 代理利用
注:参照前文Kubernetes 1.20.5 装置 traefik 在腾讯云下的实际https://www.yuque.com/duiniwukenaihe/ehb02i/odflm7#WT4ab。比拟习惯了 ingresroute 的形式就放弃这种了没有应用 ingress 或者 api 的形式。
cat monitoring.com.yaml
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
namespace: monitoring
name: alertmanager-main-http
spec:
entryPoints:
- web
routes:
- match: Host(\`alertmanager.saynaihe.com\`)
kind: Rule
services:
- name: alertmanager-main
port: 9093
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
namespace: monitoring
name: grafana-http
spec:
entryPoints:
- web
routes:
- match: Host(\`monitoring.saynaihe.com\`)
kind: Rule
services:
- name: grafana
port: 3000
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
namespace: monitoring
name: prometheus
spec:
entryPoints:
- web
routes:
- match: Host(\`prometheus.saynaihe.com\`)
kind: Rule
services:
- name: prometheus-k8s
port: 9090
---
kubectl apply -f monitoring.com.yaml
验证 traefik 代理利用是否胜利:
批改明码
先轻易演示一下,前面比拟还要批改
仅用于演示,前面起码 alertmanager Prometheus 两个 web 要加个 basic 平安验证 ….
5. 增加 kubeControllerManager kubeScheduler 监控
通过 https://prometheus.saynaihe.com/targets 页面能够看到和前几个版本一样仍然木有 kube-scheduler 和 kube-controller-manager 的监控。
批改 /etc/kubernetes/manifests/ 目录下 kube-controller-manager.yaml kube-scheduler.yaml 将 – –bind-address=127.0.0.1 批改为 – –bind-address=0.0.0.0
批改为配置文件 control manager scheduler 服务会主动重启的。期待重启验证通过。
在 manifests 目录下(这一步一点要认真看下新版的 matchLabels 产生了扭转)
grep -A2 -B2 selector kubernetes-serviceMonitor*
cat <<EOF > kube-controller-manager-scheduler.yml
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-controller-manager
labels:
app.kubernetes.io/name: kube-controller-manager
spec:
selector:
component: kube-controller-manager
type: ClusterIP
clusterIP: None
ports:
- name: https-metrics
port: 10257
targetPort: 10257
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-scheduler
labels:
app.kubernetes.io/name: kube-scheduler
spec:
selector:
component: kube-scheduler
type: ClusterIP
clusterIP: None
ports:
- name: https-metrics
port: 10259
targetPort: 10259
protocol: TCP
EOF
kubectl apply -f kube-controller-manager-scheduler.yml
cat <<EOF > kube-ep.yml
apiVersion: v1
kind: Endpoints
metadata:
labels:
k8s-app: kube-controller-manager
name: kube-controller-manager
namespace: kube-system
subsets:
- addresses:
- ip: 10.3.2.5
- ip: 10.3.2.13
- ip: 10.3.2.16
ports:
- name: https-metrics
port: 10257
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
labels:
k8s-app: kube-scheduler
name: kube-scheduler
namespace: kube-system
subsets:
- addresses:
- ip: 10.3.2.5
- ip: 10.3.2.13
- ip: 10.3.2.16
ports:
- name: https-metrics
port: 10259
protocol: TCP
EOF
kubectl apply -f kube-ep.yml
登陆 https://prometheus.saynaihe.com/targets 进行验证:
6. ECTD 的监控
kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
kubectl edit prometheus k8s -n monitoring
验证 Prometheus 是否失常挂载证书
[root@sh-master-02 yaml]# kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod.
/prometheus $ ls /etc/prometheus/secrets/etcd-certs/
ca.crt healthcheck-client.crt healthcheck-client.key
cat <<EOF > kube-ep-etcd.yml
apiVersion: v1
kind: Service
metadata:
name: etcd-k8s
namespace: kube-system
labels:
k8s-app: etcd
spec:
type: ClusterIP
clusterIP: None
ports:
- name: etcd
port: 2379
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
labels:
k8s-app: etcd
name: etcd-k8s
namespace: kube-system
subsets:
- addresses:
- ip: 10.3.2.5
- ip: 10.3.2.13
- ip: 10.3.2.16
ports:
- name: etcd
port: 2379
protocol: TCP
---
EOF
kubectl apply -f kube-ep-etcd.yml
cat <<EOF > prometheus-serviceMonitorEtcd.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-k8s
namespace: monitoring
labels:
k8s-app: etcd
spec:
jobLabel: k8s-app
endpoints:
- port: etcd
interval: 30s
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt
keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
insecureSkipVerify: true
selector:
matchLabels:
k8s-app: etcd
namespaceSelector:
matchNames:
- kube-system
EOF
kubectl apply -f prometheus-serviceMonitorEtcd.yaml
7. prometheus 配置文件批改为正式
1. 增加主动发现配置
网上轻易抄 了一个
cat <<EOF > prometheus-additional.yaml
- job_name: 'kubernetes-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
EOF
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
2. 减少存储 保留工夫 etcd secret
cat <<EOF > prometheus-prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.25.0
prometheus: k8s
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- apiVersion: v2
name: alertmanager-main
namespace: monitoring
port: web
externalLabels: {}
image: quay.io/prometheus/prometheus:v2.25.0
nodeSelector:
kubernetes.io/os: linux
podMetadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.25.0
podMonitorNamespaceSelector: {}
podMonitorSelector: {}
probeNamespaceSelector: {}
probeSelector: {}
replicas: 2
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
secrets:
- etcd-certs
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
serviceAccountName: prometheus-k8s
retention: 60d
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: 2.25.0
storage:
volumeClaimTemplate:
spec:
storageClassName: cbs-csi
resources:
requests:
storage: 50Gi
EOF
kubectl apply -f prometheus-prometheus.yaml
8. grafana 增加存储
-
新建 grafana pvc
cat <<EOF > grafana-pv.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: grafana namespace: monitoring spec: storageClassName: cbs-csi accessModes: - ReadWriteOnce resources: requests: storage: 20Gi EOF kubectl apply -f grafana-pv.yaml
批改 manifests 目录下 grafana-deployment.yaml 存储
9. grafana 增加监控模板
增加 etcd traefik 模板,import 模板号 10906 3070. 嗯 会发现 traefik 模板会呈现 Panel plugin not found: grafana-piechart-panel.
解决办法:从新构建 grafana 镜像,/usr/share/grafana/bin/grafana-cli plugins install grafana-piechart-panel 装置缺失插件
10. 微信报警
将对应秘钥填入 alertmanager.yaml
1. 配置 alertmanager.yaml
cat <<EOF > alertmanager.yaml
global:
resolve_timeout: 2m
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
route:
group_by: ['alert']
group_wait: 10s
group_interval: 1m
repeat_interval: 1h
receiver: wechat
receivers:
- name: 'wechat'
wechat_configs:
- api_secret: 'XXXXXXXXXX'
send_resolved: true
to_user: '@all'
to_party: 'XXXXXX'
agent_id: 'XXXXXXXX'
corp_id: 'XXXXXXXX'
templates:
- '/etc/config/alert/wechat.tmpl'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'production', 'instance']
EOF
2. 个性化配置报警模板,这个随便了网上有很多例子
cat <<EOF > wechat.tpl
{{define "wechat.default.message"}}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0}}
========== 异样告警 ==========
告警类型: {{$alert.Labels.alertname}}
告警级别: {{$alert.Labels.severity}}
告警详情: {{$alert.Annotations.message}}{{$alert.Annotations.description}};{{$alert.Annotations.summary}}
故障工夫: {{($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{$alert.Labels.instance}}
{{- end}}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{$alert.Labels.namespace}}
{{- end}}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{$alert.Labels.node}}
{{- end}}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{$alert.Labels.pod}}
{{- end}}
============END============
{{- end}}
{{- end}}
{{- end}}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0}}
========== 异样复原 ==========
告警类型: {{$alert.Labels.alertname}}
告警级别: {{$alert.Labels.severity}}
告警详情: {{$alert.Annotations.message}}{{$alert.Annotations.description}};{{$alert.Annotations.summary}}
故障工夫: {{($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
复原工夫: {{($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{$alert.Labels.instance}}
{{- end}}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{$alert.Labels.namespace}}
{{- end}}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{$alert.Labels.node}}
{{- end}}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{$alert.Labels.pod}}
{{- end}}
============END============
{{- end}}
{{- end}}
{{- end}}
{{- end}}
EOF
3. 部署 secret
kubectl delete secret alertmanager-main -n monitoring
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring
4. 验证
11. 彩蛋
正好集体想试一下 kubernetes 的 HPA,
[root@sh-master-02 yaml]# kubectl top pods -n qa
W0330 16:00:54.657335 2622645 top_pod.go:265] Metrics not available for pod qa/dataloader-comment-5d975d9d57-p22w9, age: 2h3m13.657327145s
error: Metrics not available for pod qa/dataloader-comment-5d975d9d57-p22w9, age: 2h3m13.657327145s
what Prometheus oprator 不是有 metrics 吗?怎么回事
kubectl logs -f prometheus-adapter-c96488cdd-vfm7h -n monitoring
如下图 …. 我装置 kubernete 时候批改了集群的 dnsDomain。没有批改配置文件,这样是有问题的
manifests 目录下 批改 prometheus-adapter-deployment.yaml 中 Prometheus-url
而后 kubectl top nodes. 能够应用了
12. 顺便讲一下 hpa
参照 https://blog.csdn.net/weixin_38320674/article/details/105460033。环境中有 metrics。从第七步骤开始
1. 打包上传到镜像库
docker build -t ccr.ccs.tencentyun.com/XXXXX/test1:0.1 .
docker push ccr.ccs.tencentyun.com/XXXXX/test1:0.1
2. 通过 deployment 部署一个 php-apache 服务
cat php-apache.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
spec:
selector:
matchLabels:
run: php-apache
replicas: 1
template:
metadata:
labels:
run: php-apache
spec:
containers:
- name: php-apache
image: ccr.ccs.tencentyun.com/XXXXX/test1:0.1
ports:
- containerPort: 80
resources:
limits:
cpu: 200m
requests:
cpu: 100m
---
apiVersion: v1
kind: Service
metadata:
name: php-apache
labels:
run: php-apache
spec:
ports:
- port: 80
selector:
run: php-apache
kubectl apply -f php-apache.yaml
3. 创立 hpa
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
上面是释意:
kubectl autoscale deployment php-apache(php-apache 示意 deployment 的名字)--cpu-percent=50(示意 cpu 使用率不超过 50%)--min=1(起码一个 pod)--max=10(最多 10 个 pod)
4. 压测 php-apache 服务,只是针对 CPU 做压测
启动一个容器,并将有限查问循环发送到 php-apache 服务(复制 k8s 的 master 节点的终端,也就是关上一个新的终端窗口):
kubectl run v1 -it --image=busybox /bin/sh
登录到容器之后,执行如下命令
while true; do wget -q -O- http://php-apache.default; done
这里只对 cpu 做了测试。简略 demo. 其余的单单讲了.
13. 其余坑爹的
无意间把 pv,pvc 删除了 …. 认为我的 storageclass 有问题。而后重新部署吧?集体感觉部署一下 prometheus-prometheus.yaml 就好了,而后 并没有呈现 Prometheus 的服务。瞄了一遍日志 7.1 整有了,从新执行下就好了,不记得本人具体哪里把这的 secret 搞掉了 …. 记录一下
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring