关于运维:Grafana-系列文章十四Helm-安装Loki

60次阅读

共计 5617 个字符,预计需要花费 15 分钟才能阅读完成。

前言

写或者翻译这么多篇 Loki 相干的文章了, 发现还没写怎么装置 😓

当初开始介绍如何应用 Helm 装置 Loki.

前提

有 Helm, 并且增加 Grafana 的官网源:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

🐾Warning:

网络受限, 须要保障网络通顺.

部署

架构

Promtail(收集) + Loki(存储及解决) + Grafana(展现)

Promtail

  1. 启用 Prometheus Operator Service Monitor 做监控
  2. 减少 external_labelscluster, 以辨认是哪个 K8S 集群;
  3. pipeline_stages 改为 cri, 以对 cri 日志做解决 (因为我的集群用的 Container Runtime 是 CRI, 而 Loki Helm 默认配置是 docker)
  4. 减少对 systemd-journal 的日志收集:
promtail:
  config:
    snippets:
      pipelineStages:
        - cri: {}

  extraArgs: 
    - -client.external-labels=cluster=ctyun
  # systemd-journal 额定配置:
  # Add additional scrape config
  extraScrapeConfigs:
    - job_name: journal
      journal:
        path: /var/log/journal
        max_age: 12h
        labels:
          job: systemd-journal
      relabel_configs:
        - source_labels: ['__journal__systemd_unit']
          target_label: 'unit'
        - source_labels: ['__journal__hostname']
          target_label: 'hostname'

  # Mount journal directory into Promtail pods
  extraVolumes:
    - name: journal
      hostPath:
        path: /var/log/journal

  extraVolumeMounts:
    - name: journal
      mountPath: /var/log/journal
      readOnly: true

Loki

  1. 启用长久化存储
  2. 启用 Prometheus Operator Service Monitor 做监控

    1. 并配置 Loki 相干 Prometheus Rule 做告警
  3. 因为集体集群日志量较小, 适当调大 ingester 相干配置

Grafana

  1. 启用长久化存储
  2. 启用 Prometheus Operator Service Monitor 做监控
  3. sidecar 都配置上, 不便动静更新 dashboards/datasources/plugins/notifiers;

Helm 装置

通过如下命令装置:

helm upgrade --install loki --namespace=loki --create-namespace grafana/loki-stack -f values.yaml

自定义 values.yaml 如下:

loki:
  enabled: true
  persistence:
    enabled: true
    storageClassName: local-path
    size: 20Gi
  serviceScheme: https
  user: admin
  password: changit!
  config:
    ingester:
      chunk_idle_period: 1h
      max_chunk_age: 4h
    compactor:
      retention_enabled: true
  serviceMonitor:
    enabled: true
    prometheusRule:
      enabled: true
      rules:
        #  Some examples from https://awesome-prometheus-alerts.grep.to/rules.html#loki
        - alert: LokiProcessTooManyRestarts
          expr: changes(process_start_time_seconds{job=~"loki"}[15m]) > 2
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: Loki process too many restarts (instance {{ $labels.instance}})
            description: "A loki process had too many restarts (target {{ $labels.instance}})\n  VALUE = {{$value}}\n  LABELS = {{$labels}}"
        - alert: LokiRequestErrors
          expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10
          for: 15m
          labels:
            severity: critical
          annotations:
            summary: Loki request errors (instance {{ $labels.instance}})
            description: "The {{$labels.job}} and {{$labels.route}} are experiencing errors\n  VALUE = {{$value}}\n  LABELS = {{$labels}}"
        - alert: LokiRequestPanic
          expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Loki request panic (instance {{ $labels.instance}})
            description: "The {{$labels.job}} is experiencing {{printf \"%.2f\"$value}}% increase of panics\n  VALUE = {{$value}}\n  LABELS = {{$labels}}"
        - alert: LokiRequestLatency
          expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Loki request latency (instance {{ $labels.instance}})
            description: "The {{$labels.job}} {{$labels.route}} is experiencing {{printf \"%.2f\"$value}}s 99th percentile latency\n  VALUE = {{$value}}\n  LABELS = {{$labels}}"

promtail:
  enabled: true
  config:
    snippets:
      pipelineStages:
        - cri: {}  
  extraArgs:
    - -client.external-labels=cluster=ctyun        
  serviceMonitor:
    # -- If enabled, ServiceMonitor resources for Prometheus Operator are created
    enabled: true

  # systemd-journal 额定配置:
  # Add additional scrape config
  extraScrapeConfigs:
    - job_name: journal
      journal:
        path: /var/log/journal
        max_age: 12h
        labels:
          job: systemd-journal
      relabel_configs:
        - source_labels: ['__journal__systemd_unit']
          target_label: 'unit'
        - source_labels: ['__journal__hostname']
          target_label: 'hostname'

  # Mount journal directory into Promtail pods
  extraVolumes:
    - name: journal
      hostPath:
        path: /var/log/journal

  extraVolumeMounts:
    - name: journal
      mountPath: /var/log/journal
      readOnly: true

fluent-bit:
  enabled: false

grafana:
  enabled: true
  adminUser: caseycui
  adminPassword: changit!
  ## Sidecars that collect the configmaps with specified label and stores the included files them into the respective folders
  ## Requires at least Grafana 5 to work and can't be used together with parameters dashboardProviders, datasources and dashboards
  sidecar:
    image:
      repository: quay.io/kiwigrid/k8s-sidecar
      tag: 1.15.6
      sha: ''
    dashboards:
      enabled: true
      SCProvider: true
      label: grafana_dashboard
    datasources:
      enabled: true
      # label that the configmaps with datasources are marked with
      label: grafana_datasource
    plugins:
      enabled: true
      # label that the configmaps with plugins are marked with
      label: grafana_plugin
    notifiers:
      enabled: true
      # label that the configmaps with notifiers are marked with
      label: grafana_notifier
  image:
    tag: 8.3.5
  persistence:
    enabled: true
    size: 2Gi
    storageClassName: local-path
  serviceMonitor:
    enabled: true
  imageRenderer:
    enabled: disable

filebeat:
  enabled: false

logstash:
  enabled: false

装置后的资源拓扑如下:

Day 2 配置 (按需)

Grafana 减少 Dashboards

在同一个 NS 下, 创立如下 ConfigMap: (只有打上 grafana_dashboard 这个 label 就会被 Grafana 的 sidecar 主动导入 )

apiVersion: v1
kind: ConfigMap
metadata:
  name: sample-grafana-dashboard
  labels:
     grafana_dashboard: "1"
data:
  k8s-dashboard.json: |-
  [...]

Grafana 减少 DataSource

在同一个 NS 下, 创立如下 ConfigMap: (只有打上 grafana_datasource 这个 label 就会被 Grafana 的 sidecar 主动导入 )

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-loki-stack
  labels:
    grafana_datasource: '1'
data:
  loki-stack-datasource.yaml: |-
    apiVersion: 1
    datasources:
    - name: Loki
      type: loki
      access: proxy
      url: http://loki:3100
      version: 1

Traefik 配置 Grafana IngressRoute

因为我是用的 Traefik 2, 通过 CRD IngressRoute 配置 Ingress, 配置如下:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: grafana
spec:
  entryPoints:
    - web
    - websecure
  routes:
    - kind: Rule
      match: Host(`grafana.ewhisper.cn`)
      middlewares:
        - name: hsts-header
          namespace: kube-system
        - name: redirectshttps
          namespace: kube-system
      services:
        - name: loki-grafana
          namespace: monitoring
          port: 80
  tls: {}

最终成果

如下:

🎉🎉🎉

📚️参考文档

  • helm-charts/charts at main · grafana/helm-charts (github.com)

Grafana 系列文章

Grafana 系列文章

三人行, 必有我师; 常识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.

正文完
 0