背景
线上kubernetes集群为1.16版本 Prometheus oprator 分支为0.4对于Prometheus oprator与kubernetes版本对应关系如下图。可见https://github.com/prometheus-operator/kube-prometheus.
注: Prometheus operator? kube-prometheus? kube-prometheus 就是 Prometheus的一种operator的部署形式....Prometheus-operator 曾经改名为 Kube-promethues。
对于部署过程能够参考超级小豆丁大佬的笔记:http://www.mydlq.club/article/10/。Prometheus这种架构图,在各位大佬的文章中都能够看到的.......
先简略部署一下Prometheus oprator(or或者叫kube-promethus)。实现微信报警的集成,其余的缓缓在生成环境中钻研。
根本过程就是Prometheus oprator 增加存储,减少微信报警,内部traefik代理利用。
1. prometheus环境的搭建
1. 克隆prometheus-operator仓库
git clone https://github.com/prometheus-operator/kube-prometheus.git
网络起因,常常会搞不下来的,还是间接下载zip包吧,其实装置版本的反对列表kubernetes1.20的版本能够应用kube-prometheus的0.6 or 0.7 还有HEAD分支的任一分支。偷个懒间接用HEAD了。
记录一下tag,当前有批改了也好能疾速批改了 让版本升级跟得上.....
上传zip包解压缩
unzip kube-prometheus-main.zip
2. 依照快捷方式来一遍
cd kube-prometheus-main/kubectl create -f manifests/setupuntil kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; donekubectl create -f manifests/kubectl get pods -n monitoring
3. imagepullbackoff
因为网络起因会呈现有些镜像下载不下来的问题,可墙外服务器下载镜像批改tag上传到harbor,批改yaml文件中镜像为对应公有镜像仓库的标签tag解决(因为我的公有仓库用的腾讯云的仓库,当初跨地区上传镜像应该个人版的不能够了,所以我应用了docker save导出镜像的形式):
kubectl describe pods kube-state-metrics-56f988c7b6-qxqjn -n monitoring
1. 应用国外服务器下载镜像,并打包为tar包下载到本地。
docker pull k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0docker save k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0 -o kube-state-metrics.tar
2. ctr导入镜像
ctr -n k8s.io i import kube-state-metrics.tar
导入的只是一个工作节点这样,然而kubernetes原本就是保障高可用用性,如果这个pod漂移调度到其余节点呢?难道要加上节点亲和性?这个节点如果就解体了呢?每个节点都导入此镜像?新退出的节点呢?还是老老实实的上传到镜像仓库吧!
失常的流程应该是这样吧?
crictl imagesctr image tag k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0 ccr.ccs.tencentyun.com/k8s_containers/kube-state-metrics:v2.0.0-rc.0
然而为什么是not found?不晓得是不是标签格局问题....。反正就这样了 ,而后上传到镜像库,具体命令能够参考https://blog.csdn.net/tongzidane/article/details/114587138 https://blog.csdn.net/liumiaocn/article/details/103320426/
(上传我的仓库权限还是有问题(仓库外面能够下载啊然而我........搞迷糊了),不先搞了间接导入了)
反正咋样把kube-state-metrics-XXXX启动起来就好了,有工夫好好钻研下ctr crictl命令 还是有点懵。
3. 验证服务都失常启动
kubectl get pod -n monitoringkubectl get svc -n monitoring
4. 应用traefik代理利用
注: 参照前文Kubernetes 1.20.5 装置traefik在腾讯云下的实际https://www.yuque.com/duiniwukenaihe/ehb02i/odflm7#WT4ab。比拟习惯了ingresroute的形式就放弃这种了没有应用ingress 或者api的形式。
cat monitoring.com.yaml
apiVersion: traefik.containo.us/v1alpha1kind: IngressRoutemetadata: namespace: monitoring name: alertmanager-main-httpspec: entryPoints: - web routes: - match: Host(\`alertmanager.saynaihe.com\`) kind: Rule services: - name: alertmanager-main port: 9093---apiVersion: traefik.containo.us/v1alpha1kind: IngressRoutemetadata: namespace: monitoring name: grafana-httpspec: entryPoints: - web routes: - match: Host(\`monitoring.saynaihe.com\`) kind: Rule services: - name: grafana port: 3000---apiVersion: traefik.containo.us/v1alpha1kind: IngressRoutemetadata: namespace: monitoring name: prometheusspec: entryPoints: - web routes: - match: Host(\`prometheus.saynaihe.com\`) kind: Rule services: - name: prometheus-k8s port: 9090---
kubectl apply -f monitoring.com.yaml
验证traefik代理利用是否胜利:
批改明码
先轻易演示一下,前面比拟还要批改
仅用于演示,前面起码alertmanager Prometheus两个web要加个basic平安验证....
5. 增加 kubeControllerManager kubeScheduler监控
通过https://prometheus.saynaihe.com/targets 页面能够看到和前几个版本一样仍然木有kube-scheduler 和 kube-controller-manager 的监控。
批改/etc/kubernetes/manifests/目录下kube-controller-manager.yaml kube-scheduler.yaml将 - --bind-address=127.0.0.1 批改为 - --bind-address=0.0.0.0
批改为配置文件 control manager scheduler服务会主动重启的。期待重启验证通过。
在manifests目录下(这一步一点要认真看下新版的matchLabels产生了扭转)
grep -A2 -B2 selector kubernetes-serviceMonitor*
cat <<EOF > kube-controller-manager-scheduler.ymlapiVersion: v1kind: Servicemetadata: namespace: kube-system name: kube-controller-manager labels: app.kubernetes.io/name: kube-controller-managerspec: selector: component: kube-controller-manager type: ClusterIP clusterIP: None ports: - name: https-metrics port: 10257 targetPort: 10257 protocol: TCP---apiVersion: v1kind: Servicemetadata: namespace: kube-system name: kube-scheduler labels: app.kubernetes.io/name: kube-schedulerspec: selector: component: kube-scheduler type: ClusterIP clusterIP: None ports: - name: https-metrics port: 10259 targetPort: 10259 protocol: TCPEOF kubectl apply -f kube-controller-manager-scheduler.yml
cat <<EOF > kube-ep.ymlapiVersion: v1kind: Endpointsmetadata: labels: k8s-app: kube-controller-manager name: kube-controller-manager namespace: kube-systemsubsets:- addresses: - ip: 10.3.2.5 - ip: 10.3.2.13 - ip: 10.3.2.16 ports: - name: https-metrics port: 10257 protocol: TCP---apiVersion: v1kind: Endpointsmetadata: labels: k8s-app: kube-scheduler name: kube-scheduler namespace: kube-systemsubsets:- addresses: - ip: 10.3.2.5 - ip: 10.3.2.13 - ip: 10.3.2.16 ports: - name: https-metrics port: 10259 protocol: TCPEOF kubectl apply -f kube-ep.yml
登陆https://prometheus.saynaihe.com/targets进行验证:
6. ECTD的监控
kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crtkubectl edit prometheus k8s -n monitoring
验证Prometheus是否失常挂载证书
[root@sh-master-02 yaml]# kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoringkubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.Defaulting container name to prometheus.Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod./prometheus $ ls /etc/prometheus/secrets/etcd-certs/ca.crt healthcheck-client.crt healthcheck-client.key
cat <<EOF > kube-ep-etcd.ymlapiVersion: v1kind: Servicemetadata: name: etcd-k8s namespace: kube-system labels: k8s-app: etcdspec: type: ClusterIP clusterIP: None ports: - name: etcd port: 2379 protocol: TCP---apiVersion: v1kind: Endpointsmetadata: labels: k8s-app: etcd name: etcd-k8s namespace: kube-systemsubsets:- addresses: - ip: 10.3.2.5 - ip: 10.3.2.13 - ip: 10.3.2.16 ports: - name: etcd port: 2379 protocol: TCP---EOF kubectl apply -f kube-ep-etcd.yml
cat <<EOF > prometheus-serviceMonitorEtcd.yamlapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: etcd-k8s namespace: monitoring labels: k8s-app: etcdspec: jobLabel: k8s-app endpoints: - port: etcd interval: 30s scheme: https tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/ca.crt certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key insecureSkipVerify: true selector: matchLabels: k8s-app: etcd namespaceSelector: matchNames: - kube-systemEOF kubectl apply -f prometheus-serviceMonitorEtcd.yaml
7. prometheus配置文件批改为正式
1. 增加主动发现配置
网上轻易抄 了一个
cat <<EOF > prometheus-additional.yaml- job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_nameEOFkubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
2. 减少存储 保留工夫 etcd secret
cat <<EOF > prometheus-prometheus.yamlapiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.25.0 prometheus: k8s name: k8s namespace: monitoringspec: alerting: alertmanagers: - apiVersion: v2 name: alertmanager-main namespace: monitoring port: web externalLabels: {} image: quay.io/prometheus/prometheus:v2.25.0 nodeSelector: kubernetes.io/os: linux podMetadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.25.0 podMonitorNamespaceSelector: {} podMonitorSelector: {} probeNamespaceSelector: {} probeSelector: {} replicas: 2 resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules secrets: - etcd-certs securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 additionalScrapeConfigs: name: additional-configs key: prometheus-additional.yaml serviceAccountName: prometheus-k8s retention: 60d serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: 2.25.0 storage: volumeClaimTemplate: spec: storageClassName: cbs-csi resources: requests: storage: 50GiEOF kubectl apply -f prometheus-prometheus.yaml
8. grafana增加存储
新建grafana pvc
cat <<EOF > grafana-pv.yamlapiVersion: v1kind: PersistentVolumeClaimmetadata: name: grafana namespace: monitoringspec: storageClassName: cbs-csi accessModes: - ReadWriteOnce resources: requests: storage: 20GiEOF kubectl apply -f grafana-pv.yaml
批改manifests目录下grafana-deployment.yaml存储
9. grafana增加监控模板
增加etcd traefik 模板,import模板号10906 3070.嗯 会发现traefik模板会呈现Panel plugin not found: grafana-piechart-panel.
解决办法:从新构建grafana镜像,/usr/share/grafana/bin/grafana-cli plugins install grafana-piechart-panel装置缺失插件
10. 微信报警
将对应秘钥填入alertmanager.yaml
1. 配置alertmanager.yaml
cat <<EOF > alertmanager.yaml global: resolve_timeout: 2m wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' route: group_by: ['alert'] group_wait: 10s group_interval: 1m repeat_interval: 1h receiver: wechat receivers: - name: 'wechat' wechat_configs: - api_secret: 'XXXXXXXXXX' send_resolved: true to_user: '@all' to_party: 'XXXXXX' agent_id: 'XXXXXXXX' corp_id: 'XXXXXXXX' templates: - '/etc/config/alert/wechat.tmpl' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'production', 'instance']EOF
2. 个性化配置报警模板,这个随便了网上有很多例子
cat <<EOF > wechat.tpl {{ define "wechat.default.message" }}{{- if gt (len .Alerts.Firing) 0 -}}{{- range $index, $alert := .Alerts -}}{{- if eq $index 0 }}==========异样告警==========告警类型: {{ $alert.Labels.alertname }}告警级别: {{ $alert.Labels.severity }}告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}故障工夫: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}{{- if gt (len $alert.Labels.instance) 0 }}实例信息: {{ $alert.Labels.instance }}{{- end }}{{- if gt (len $alert.Labels.namespace) 0 }}命名空间: {{ $alert.Labels.namespace }}{{- end }}{{- if gt (len $alert.Labels.node) 0 }}节点信息: {{ $alert.Labels.node }}{{- end }}{{- if gt (len $alert.Labels.pod) 0 }}实例名称: {{ $alert.Labels.pod }}{{- end }}============END============{{- end }}{{- end }}{{- end }}{{- if gt (len .Alerts.Resolved) 0 -}}{{- range $index, $alert := .Alerts -}}{{- if eq $index 0 }}==========异样复原==========告警类型: {{ $alert.Labels.alertname }}告警级别: {{ $alert.Labels.severity }}告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}故障工夫: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}复原工夫: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}{{- if gt (len $alert.Labels.instance) 0 }}实例信息: {{ $alert.Labels.instance }}{{- end }}{{- if gt (len $alert.Labels.namespace) 0 }}命名空间: {{ $alert.Labels.namespace }}{{- end }}{{- if gt (len $alert.Labels.node) 0 }}节点信息: {{ $alert.Labels.node }}{{- end }}{{- if gt (len $alert.Labels.pod) 0 }}实例名称: {{ $alert.Labels.pod }}{{- end }}============END============{{- end }}{{- end }}{{- end }}{{- end }}EOF
3. 部署secret
kubectl delete secret alertmanager-main -n monitoringkubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring
4. 验证
11. 彩蛋
正好集体想试一下kubernetes的HPA ,
[root@sh-master-02 yaml]# kubectl top pods -n qaW0330 16:00:54.657335 2622645 top_pod.go:265] Metrics not available for pod qa/dataloader-comment-5d975d9d57-p22w9, age: 2h3m13.657327145serror: Metrics not available for pod qa/dataloader-comment-5d975d9d57-p22w9, age: 2h3m13.657327145s
what Prometheus oprator不是有metrics吗 ?怎么回事
kubectl logs -f prometheus-adapter-c96488cdd-vfm7h -n monitoring
如下图.... 我装置kubernete时候批改了集群的dnsDomain。没有批改配置文件,这样是有问题的
manifests目录下 批改prometheus-adapter-deployment.yaml中Prometheus-url
而后kubectl top nodes.能够应用了
12. 顺便讲一下hpa
参照https://blog.csdn.net/weixin_38320674/article/details/105460033。环境中有metrics。从第七步骤开始
1. 打包上传到镜像库
docker build -t ccr.ccs.tencentyun.com/XXXXX/test1:0.1 .docker push ccr.ccs.tencentyun.com/XXXXX/test1:0.1
2. 通过deployment部署一个php-apache服务
cat php-apache.yaml
apiVersion: apps/v1kind: Deploymentmetadata: name: php-apachespec: selector: matchLabels: run: php-apache replicas: 1 template: metadata: labels: run: php-apache spec: containers: - name: php-apache image: ccr.ccs.tencentyun.com/XXXXX/test1:0.1 ports: - containerPort: 80 resources: limits: cpu: 200m requests: cpu: 100m---apiVersion: v1kind: Servicemetadata: name: php-apache labels: run: php-apachespec: ports: - port: 80 selector: run: php-apache
kubectl apply -f php-apache.yaml
3. 创立hpa
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
上面是释意:
kubectl autoscale deployment php-apache (php-apache示意deployment的名字) --cpu-percent=50(示意cpu使用率不超过50%) --min=1(起码一个pod) --max=10(最多10个pod)
4.压测php-apache服务,只是针对CPU做压测
启动一个容器,并将有限查问循环发送到php-apache服务(复制k8s的master节点的终端,也就是关上一个新的终端窗口):
kubectl run v1 -it --image=busybox /bin/sh
登录到容器之后,执行如下命令
while true; do wget -q -O- http://php-apache.default; done
这里只对cpu做了测试。简略demo.其余的单单讲了.
13. 其余坑爹的
无意间把pv,pvc删除了.... 认为我的storageclass有问题。而后重新部署吧 ? 集体感觉部署一下 prometheus-prometheus.yaml就好了,而后 并没有呈现Prometheus的服务。瞄了一遍日志7.1整有了,从新执行下就好了,不记得本人具体哪里把这的secret 搞掉了....记录一下
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring