关于运维:Grafana-系列文章十四Helm-安装Loki

写或者翻译这么多篇 Loki 相干的文章了, 发现还没写怎么装置 😓

当初开始介绍如何应用 Helm 装置 Loki.

有 Helm, 并且增加 Grafana 的官网源:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

🐾Warning:

网络受限, 须要保障网络通顺.

Promtail(收集) + Loki(存储及解决) + Grafana(展现)

启用 Prometheus Operator Service Monitor 做监控
减少 external_labels – cluster, 以辨认是哪个 K8S 集群;
pipeline_stages 改为 cri, 以对 cri 日志做解决 (因为我的集群用的 Container Runtime 是 CRI, 而 Loki Helm 默认配置是 docker)
减少对 systemd-journal 的日志收集:

promtail:
  config:
    snippets:
      pipelineStages:
        - cri: {}

  extraArgs: 
    - -client.external-labels=cluster=ctyun
  # systemd-journal 额定配置:
  # Add additional scrape config
  extraScrapeConfigs:
    - job_name: journal
      journal:
        path: /var/log/journal
        max_age: 12h
        labels:
          job: systemd-journal
      relabel_configs:
        - source_labels: ['__journal__systemd_unit']
          target_label: 'unit'
        - source_labels: ['__journal__hostname']
          target_label: 'hostname'

  # Mount journal directory into Promtail pods
  extraVolumes:
    - name: journal
      hostPath:
        path: /var/log/journal

  extraVolumeMounts:
    - name: journal
      mountPath: /var/log/journal
      readOnly: true

启用长久化存储
启用 Prometheus Operator Service Monitor 做监控
1. 并配置 Loki 相干 Prometheus Rule 做告警
因为集体集群日志量较小, 适当调大 ingester 相干配置

启用长久化存储
启用 Prometheus Operator Service Monitor 做监控
sidecar 都配置上, 不便动静更新 dashboards/datasources/plugins/notifiers;

通过如下命令装置:

helm upgrade --install loki --namespace=loki --create-namespace grafana/loki-stack -f values.yaml

自定义 values.yaml 如下:

loki:
  enabled: true
  persistence:
    enabled: true
    storageClassName: local-path
    size: 20Gi
  serviceScheme: https
  user: admin
  password: changit!
  config:
    ingester:
      chunk_idle_period: 1h
      max_chunk_age: 4h
    compactor:
      retention_enabled: true
  serviceMonitor:
    enabled: true
    prometheusRule:
      enabled: true
      rules:
        #  Some examples from https://awesome-prometheus-alerts.grep.to/rules.html#loki
        - alert: LokiProcessTooManyRestarts
          expr: changes(process_start_time_seconds{job=~"loki"}[15m]) > 2
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: Loki process too many restarts (instance {{ $labels.instance}})
            description: "A loki process had too many restarts (target {{ $labels.instance}})\n  VALUE = {{$value}}\n  LABELS = {{$labels}}"
        - alert: LokiRequestErrors
          expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10
          for: 15m
          labels:
            severity: critical
          annotations:
            summary: Loki request errors (instance {{ $labels.instance}})
            description: "The {{$labels.job}} and {{$labels.route}} are experiencing errors\n  VALUE = {{$value}}\n  LABELS = {{$labels}}"
        - alert: LokiRequestPanic
          expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Loki request panic (instance {{ $labels.instance}})
            description: "The {{$labels.job}} is experiencing {{printf \"%.2f\"$value}}% increase of panics\n  VALUE = {{$value}}\n  LABELS = {{$labels}}"
        - alert: LokiRequestLatency
          expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Loki request latency (instance {{ $labels.instance}})
            description: "The {{$labels.job}} {{$labels.route}} is experiencing {{printf \"%.2f\"$value}}s 99th percentile latency\n  VALUE = {{$value}}\n  LABELS = {{$labels}}"

promtail:
  enabled: true
  config:
    snippets:
      pipelineStages:
        - cri: {}  
  extraArgs:
    - -client.external-labels=cluster=ctyun        
  serviceMonitor:
    # -- If enabled, ServiceMonitor resources for Prometheus Operator are created
    enabled: true

  # systemd-journal 额定配置:
  # Add additional scrape config
  extraScrapeConfigs:
    - job_name: journal
      journal:
        path: /var/log/journal
        max_age: 12h
        labels:
          job: systemd-journal
      relabel_configs:
        - source_labels: ['__journal__systemd_unit']
          target_label: 'unit'
        - source_labels: ['__journal__hostname']
          target_label: 'hostname'

  # Mount journal directory into Promtail pods
  extraVolumes:
    - name: journal
      hostPath:
        path: /var/log/journal

  extraVolumeMounts:
    - name: journal
      mountPath: /var/log/journal
      readOnly: true

fluent-bit:
  enabled: false

grafana:
  enabled: true
  adminUser: caseycui
  adminPassword: changit!
  ## Sidecars that collect the configmaps with specified label and stores the included files them into the respective folders
  ## Requires at least Grafana 5 to work and can't be used together with parameters dashboardProviders, datasources and dashboards
  sidecar:
    image:
      repository: quay.io/kiwigrid/k8s-sidecar
      tag: 1.15.6
      sha: ''
    dashboards:
      enabled: true
      SCProvider: true
      label: grafana_dashboard
    datasources:
      enabled: true
      # label that the configmaps with datasources are marked with
      label: grafana_datasource
    plugins:
      enabled: true
      # label that the configmaps with plugins are marked with
      label: grafana_plugin
    notifiers:
      enabled: true
      # label that the configmaps with notifiers are marked with
      label: grafana_notifier
  image:
    tag: 8.3.5
  persistence:
    enabled: true
    size: 2Gi
    storageClassName: local-path
  serviceMonitor:
    enabled: true
  imageRenderer:
    enabled: disable

filebeat:
  enabled: false

logstash:
  enabled: false

装置后的资源拓扑如下:

在同一个 NS 下, 创立如下 ConfigMap: (只有打上 grafana_dashboard 这个 label 就会被 Grafana 的 sidecar 主动导入 )

apiVersion: v1
kind: ConfigMap
metadata:
  name: sample-grafana-dashboard
  labels:
     grafana_dashboard: "1"
data:
  k8s-dashboard.json: |-
  [...]

在同一个 NS 下, 创立如下 ConfigMap: (只有打上 grafana_datasource 这个 label 就会被 Grafana 的 sidecar 主动导入 )

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-loki-stack
  labels:
    grafana_datasource: '1'
data:
  loki-stack-datasource.yaml: |-
    apiVersion: 1
    datasources:
    - name: Loki
      type: loki
      access: proxy
      url: http://loki:3100
      version: 1

因为我是用的 Traefik 2, 通过 CRD IngressRoute 配置 Ingress, 配置如下:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: grafana
spec:
  entryPoints:
    - web
    - websecure
  routes:
    - kind: Rule
      match: Host(`grafana.ewhisper.cn`)
      middlewares:
        - name: hsts-header
          namespace: kube-system
        - name: redirectshttps
          namespace: kube-system
      services:
        - name: loki-grafana
          namespace: monitoring
          port: 80
  tls: {}

如下:

🎉🎉🎉

helm-charts/charts at main · grafana/helm-charts (github.com)

Grafana 系列文章

三人行, 必有我师; 常识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.

关于运维:Grafana-系列文章十四Helm-安装Loki

前言

前提

部署

架构

Promtail

Loki

Grafana

Helm 装置

Day 2 配置 (按需)

Grafana 减少 Dashboards

Grafana 减少 DataSource

Traefik 配置 Grafana IngressRoute

最终成果

📚️参考文档

Grafana 系列文章