前言
写或者翻译这么多篇 Loki 相干的文章了, 发现还没写怎么装置 😓
当初开始介绍如何应用 Helm 装置 Loki.
前提
有 Helm, 并且增加 Grafana 的官网源:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
🐾Warning:
网络受限, 须要保障网络通顺.
部署
架构
Promtail(收集) + Loki(存储及解决) + Grafana(展现)
Promtail
- 启用 Prometheus Operator Service Monitor 做监控
- 减少
external_labels
–cluster
, 以辨认是哪个 K8S 集群; pipeline_stages
改为cri
, 以对 cri 日志做解决 (因为我的集群用的 Container Runtime 是 CRI, 而 Loki Helm 默认配置是docker
)- 减少对
systemd-journal
的日志收集:
promtail:
config:
snippets:
pipelineStages:
- cri: {}
extraArgs:
- -client.external-labels=cluster=ctyun
# systemd-journal 额定配置:
# Add additional scrape config
extraScrapeConfigs:
- job_name: journal
journal:
path: /var/log/journal
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
- source_labels: ['__journal__hostname']
target_label: 'hostname'
# Mount journal directory into Promtail pods
extraVolumes:
- name: journal
hostPath:
path: /var/log/journal
extraVolumeMounts:
- name: journal
mountPath: /var/log/journal
readOnly: true
Loki
- 启用长久化存储
-
启用 Prometheus Operator Service Monitor 做监控
- 并配置 Loki 相干 Prometheus Rule 做告警
- 因为集体集群日志量较小, 适当调大 ingester 相干配置
Grafana
- 启用长久化存储
- 启用 Prometheus Operator Service Monitor 做监控
- sidecar 都配置上, 不便动静更新 dashboards/datasources/plugins/notifiers;
Helm 装置
通过如下命令装置:
helm upgrade --install loki --namespace=loki --create-namespace grafana/loki-stack -f values.yaml
自定义 values.yaml 如下:
loki:
enabled: true
persistence:
enabled: true
storageClassName: local-path
size: 20Gi
serviceScheme: https
user: admin
password: changit!
config:
ingester:
chunk_idle_period: 1h
max_chunk_age: 4h
compactor:
retention_enabled: true
serviceMonitor:
enabled: true
prometheusRule:
enabled: true
rules:
# Some examples from https://awesome-prometheus-alerts.grep.to/rules.html#loki
- alert: LokiProcessTooManyRestarts
expr: changes(process_start_time_seconds{job=~"loki"}[15m]) > 2
for: 0m
labels:
severity: warning
annotations:
summary: Loki process too many restarts (instance {{ $labels.instance}})
description: "A loki process had too many restarts (target {{ $labels.instance}})\n VALUE = {{$value}}\n LABELS = {{$labels}}"
- alert: LokiRequestErrors
expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10
for: 15m
labels:
severity: critical
annotations:
summary: Loki request errors (instance {{ $labels.instance}})
description: "The {{$labels.job}} and {{$labels.route}} are experiencing errors\n VALUE = {{$value}}\n LABELS = {{$labels}}"
- alert: LokiRequestPanic
expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
for: 5m
labels:
severity: critical
annotations:
summary: Loki request panic (instance {{ $labels.instance}})
description: "The {{$labels.job}} is experiencing {{printf \"%.2f\"$value}}% increase of panics\n VALUE = {{$value}}\n LABELS = {{$labels}}"
- alert: LokiRequestLatency
expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le))) > 1
for: 5m
labels:
severity: critical
annotations:
summary: Loki request latency (instance {{ $labels.instance}})
description: "The {{$labels.job}} {{$labels.route}} is experiencing {{printf \"%.2f\"$value}}s 99th percentile latency\n VALUE = {{$value}}\n LABELS = {{$labels}}"
promtail:
enabled: true
config:
snippets:
pipelineStages:
- cri: {}
extraArgs:
- -client.external-labels=cluster=ctyun
serviceMonitor:
# -- If enabled, ServiceMonitor resources for Prometheus Operator are created
enabled: true
# systemd-journal 额定配置:
# Add additional scrape config
extraScrapeConfigs:
- job_name: journal
journal:
path: /var/log/journal
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
- source_labels: ['__journal__hostname']
target_label: 'hostname'
# Mount journal directory into Promtail pods
extraVolumes:
- name: journal
hostPath:
path: /var/log/journal
extraVolumeMounts:
- name: journal
mountPath: /var/log/journal
readOnly: true
fluent-bit:
enabled: false
grafana:
enabled: true
adminUser: caseycui
adminPassword: changit!
## Sidecars that collect the configmaps with specified label and stores the included files them into the respective folders
## Requires at least Grafana 5 to work and can't be used together with parameters dashboardProviders, datasources and dashboards
sidecar:
image:
repository: quay.io/kiwigrid/k8s-sidecar
tag: 1.15.6
sha: ''
dashboards:
enabled: true
SCProvider: true
label: grafana_dashboard
datasources:
enabled: true
# label that the configmaps with datasources are marked with
label: grafana_datasource
plugins:
enabled: true
# label that the configmaps with plugins are marked with
label: grafana_plugin
notifiers:
enabled: true
# label that the configmaps with notifiers are marked with
label: grafana_notifier
image:
tag: 8.3.5
persistence:
enabled: true
size: 2Gi
storageClassName: local-path
serviceMonitor:
enabled: true
imageRenderer:
enabled: disable
filebeat:
enabled: false
logstash:
enabled: false
装置后的资源拓扑如下:
Day 2 配置 (按需)
Grafana 减少 Dashboards
在同一个 NS 下, 创立如下 ConfigMap: (只有打上 grafana_dashboard
这个 label 就会被 Grafana 的 sidecar 主动导入 )
apiVersion: v1
kind: ConfigMap
metadata:
name: sample-grafana-dashboard
labels:
grafana_dashboard: "1"
data:
k8s-dashboard.json: |-
[...]
Grafana 减少 DataSource
在同一个 NS 下, 创立如下 ConfigMap: (只有打上 grafana_datasource
这个 label 就会被 Grafana 的 sidecar 主动导入 )
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-loki-stack
labels:
grafana_datasource: '1'
data:
loki-stack-datasource.yaml: |-
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
version: 1
Traefik 配置 Grafana IngressRoute
因为我是用的 Traefik 2, 通过 CRD IngressRoute 配置 Ingress, 配置如下:
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: grafana
spec:
entryPoints:
- web
- websecure
routes:
- kind: Rule
match: Host(`grafana.ewhisper.cn`)
middlewares:
- name: hsts-header
namespace: kube-system
- name: redirectshttps
namespace: kube-system
services:
- name: loki-grafana
namespace: monitoring
port: 80
tls: {}
最终成果
如下:
🎉🎉🎉
📚️参考文档
- helm-charts/charts at main · grafana/helm-charts (github.com)
Grafana 系列文章
Grafana 系列文章
三人行, 必有我师; 常识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.