关于运维:Grafana-系列文章十四Helm-安装Loki

前言

写或者翻译这么多篇 Loki 相干的文章了, 发现还没写怎么装置

当初开始介绍如何应用 Helm 装置 Loki.

前提

有 Helm, 并且增加 Grafana 的官网源:

helm repo add grafana https://grafana.github.io/helm-chartshelm repo update

Warning:
网络受限, 须要保障网络通顺.

部署

架构

Promtail(收集) + Loki(存储及解决) + Grafana(展现)

Promtail

启用 Prometheus Operator Service Monitor 做监控
减少external_labels - cluster, 以辨认是哪个 K8S 集群;
pipeline_stages 改为 cri, 以对 cri 日志做解决(因为我的集群用的 Container Runtime 是 CRI, 而 Loki Helm 默认配置是 docker)
减少对 systemd-journal 的日志收集:

promtail:  config:    snippets:      pipelineStages:        - cri: {}  extraArgs:     - -client.external-labels=cluster=ctyun  # systemd-journal 额定配置:  # Add additional scrape config  extraScrapeConfigs:    - job_name: journal      journal:        path: /var/log/journal        max_age: 12h        labels:          job: systemd-journal      relabel_configs:        - source_labels: ['__journal__systemd_unit']          target_label: 'unit'        - source_labels: ['__journal__hostname']          target_label: 'hostname'  # Mount journal directory into Promtail pods  extraVolumes:    - name: journal      hostPath:        path: /var/log/journal  extraVolumeMounts:    - name: journal      mountPath: /var/log/journal      readOnly: true

Loki

启用长久化存储
启用 Prometheus Operator Service Monitor 做监控
1. 并配置 Loki 相干 Prometheus Rule 做告警
因为集体集群日志量较小, 适当调大 ingester 相干配置

Grafana

启用长久化存储
启用 Prometheus Operator Service Monitor 做监控
sidecar 都配置上, 不便动静更新 dashboards/datasources/plugins/notifiers;

Helm 装置

通过如下命令装置:

helm upgrade --install loki --namespace=loki --create-namespace grafana/loki-stack -f values.yaml

自定义 values.yaml 如下:

loki:  enabled: true  persistence:    enabled: true    storageClassName: local-path    size: 20Gi  serviceScheme: https  user: admin  password: changit!  config:    ingester:      chunk_idle_period: 1h      max_chunk_age: 4h    compactor:      retention_enabled: true  serviceMonitor:    enabled: true    prometheusRule:      enabled: true      rules:        #  Some examples from https://awesome-prometheus-alerts.grep.to/rules.html#loki        - alert: LokiProcessTooManyRestarts          expr: changes(process_start_time_seconds{job=~"loki"}[15m]) > 2          for: 0m          labels:            severity: warning          annotations:            summary: Loki process too many restarts (instance {{ $labels.instance }})            description: "A loki process had too many restarts (target {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"        - alert: LokiRequestErrors          expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10          for: 15m          labels:            severity: critical          annotations:            summary: Loki request errors (instance {{ $labels.instance }})            description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"        - alert: LokiRequestPanic          expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0          for: 5m          labels:            severity: critical          annotations:            summary: Loki request panic (instance {{ $labels.instance }})            description: "The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"        - alert: LokiRequestLatency          expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1          for: 5m          labels:            severity: critical          annotations:            summary: Loki request latency (instance {{ $labels.instance }})            description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"promtail:  enabled: true  config:    snippets:      pipelineStages:        - cri: {}    extraArgs:    - -client.external-labels=cluster=ctyun          serviceMonitor:    # -- If enabled, ServiceMonitor resources for Prometheus Operator are created    enabled: true  # systemd-journal 额定配置:  # Add additional scrape config  extraScrapeConfigs:    - job_name: journal      journal:        path: /var/log/journal        max_age: 12h        labels:          job: systemd-journal      relabel_configs:        - source_labels: ['__journal__systemd_unit']          target_label: 'unit'        - source_labels: ['__journal__hostname']          target_label: 'hostname'  # Mount journal directory into Promtail pods  extraVolumes:    - name: journal      hostPath:        path: /var/log/journal  extraVolumeMounts:    - name: journal      mountPath: /var/log/journal      readOnly: truefluent-bit:  enabled: falsegrafana:  enabled: true  adminUser: caseycui  adminPassword: changit!  ## Sidecars that collect the configmaps with specified label and stores the included files them into the respective folders  ## Requires at least Grafana 5 to work and can't be used together with parameters dashboardProviders, datasources and dashboards  sidecar:    image:      repository: quay.io/kiwigrid/k8s-sidecar      tag: 1.15.6      sha: ''    dashboards:      enabled: true      SCProvider: true      label: grafana_dashboard    datasources:      enabled: true      # label that the configmaps with datasources are marked with      label: grafana_datasource    plugins:      enabled: true      # label that the configmaps with plugins are marked with      label: grafana_plugin    notifiers:      enabled: true      # label that the configmaps with notifiers are marked with      label: grafana_notifier  image:    tag: 8.3.5  persistence:    enabled: true    size: 2Gi    storageClassName: local-path  serviceMonitor:    enabled: true  imageRenderer:    enabled: disablefilebeat:  enabled: falselogstash:  enabled: false

装置后的资源拓扑如下:

Day 2 配置(按需)

Grafana 减少 Dashboards

在同一个 NS 下, 创立如下 ConfigMap: (只有打上grafana_dashboard 这个 label 就会被 Grafana 的 sidecar 主动导入)

apiVersion: v1kind: ConfigMapmetadata:  name: sample-grafana-dashboard  labels:     grafana_dashboard: "1"data:  k8s-dashboard.json: |-  [...]

Grafana 减少 DataSource

在同一个 NS 下, 创立如下 ConfigMap: (只有打上grafana_datasource 这个 label 就会被 Grafana 的 sidecar 主动导入)

apiVersion: v1kind: ConfigMapmetadata:  name: loki-loki-stack  labels:    grafana_datasource: '1'data:  loki-stack-datasource.yaml: |-    apiVersion: 1    datasources:    - name: Loki      type: loki      access: proxy      url: http://loki:3100      version: 1

Traefik 配置 Grafana IngressRoute

因为我是用的 Traefik 2, 通过 CRD IngressRoute 配置 Ingress, 配置如下:

apiVersion: traefik.containo.us/v1alpha1kind: IngressRoutemetadata:  name: grafanaspec:  entryPoints:    - web    - websecure  routes:    - kind: Rule      match: Host(`grafana.ewhisper.cn`)      middlewares:        - name: hsts-header          namespace: kube-system        - name: redirectshttps          namespace: kube-system      services:        - name: loki-grafana          namespace: monitoring          port: 80  tls: {}

最终成果

如下:

️参考文档

helm-charts/charts at main · grafana/helm-charts (github.com)

Grafana 系列文章

三人行, 必有我师; 常识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.