共计 7263 个字符,预计需要花费 19 分钟才能阅读完成。
摘要:ES 集群是进行大数据存储和剖析,疾速检索的利器,本文简述了 ES 的集群架构,并提供了在 Kubernetes 中疾速部署 ES 集群的样例;对 ES 集群的监控运维工具进行了介绍,并提供了局部问题定位教训,最初总结了罕用 ES 集群的 API 调用办法。
本文分享自华为云社区《Kubernetes 中部署 ES 集群及运维》,原文作者:minucas。
ES 集群架构:
ES 集群分为单点模式和集群模式,其中单点模式个别在生产环境不举荐应用,举荐应用集群模式部署。其中集群模式又分为 Master 节点与 Data 节点由同一个节点承当,以及 Master 节点与 Data 节点由不同节点承当的部署模式。Master 节点与 Data 节点离开的部署形式可靠性更强。下图为 ES 集群的部署架构图:
采纳 K8s 进行 ES 集群部署:
1、采纳 k8s statefulset 部署,可疾速的进行扩缩容 es 节点,本例子采纳 3 Master Node + 12 Data Node 形式部署
2、通过 k8s service 配置了对应的域名和服务发现,确保集群能主动联通和监控
kubectl -s http://ip:port create -f es-master.yaml
kubectl -s http://ip:port create -f es-data.yaml
kubectl -s http://ip:port create -f es-service.yaml
es-master.yaml:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
k8s-app: es
kubernetes.io/cluster-service: "true"
version: v6.2.5
name: es-master
namespace: default
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: es
version: v6.2.5
serviceName: es
template:
metadata:
labels:
k8s-app: camp-es
kubernetes.io/cluster-service: "true"
version: v6.2.5
spec:
containers:
- env:
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ELASTICSEARCH_SERVICE_NAME
value: es
- name: NODE_MASTER
value: "true"
- name: NODE_DATA
value: "false"
- name: ES_HEAP_SIZE
value: 4g
- name: ES_JAVA_OPTS
value: -Xmx4g -Xms4g
- name: cluster.name
value: es
image: elasticsearch:v6.2.5
imagePullPolicy: Always
name: es
ports:
- containerPort: 9200
hostPort: 9200
name: db
protocol: TCP
- containerPort: 9300
hostPort: 9300
name: transport
protocol: TCP
resources:
limits:
cpu: "6"
memory: 12Gi
requests:
cpu: "4"
memory: 8Gi
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
volumeMounts:
- mountPath: /data
name: es
- command:
- /bin/elasticsearch_exporter
- -es.uri=http://localhost:9200
- -es.all=true
image: elasticsearch_exporter:1.0.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9108
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: es-exporter
ports:
- containerPort: 9108
hostPort: 9108
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9108
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 25m
memory: 64Mi
securityContext:
capabilities:
drop:
- SETPCAP
- MKNOD
- AUDIT_WRITE
- CHOWN
- NET_RAW
- DAC_OVERRIDE
- FOWNER
- FSETID
- KILL
- SETGID
- SETUID
- NET_BIND_SERVICE
- SYS_CHROOT
- SETFCAP
readOnlyRootFilesystem: true
dnsPolicy: ClusterFirst
initContainers:
- command:
- /sbin/sysctl
- -w
- vm.max_map_count=262144
image: alpine:3.6
imagePullPolicy: IfNotPresent
name: elasticsearch-logging-init
resources: {}
securityContext:
privileged: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
volumes:
- hostPath:
path: /Data/es
type: DirectoryOrCreate
name: es
es-data.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
k8s-app: es
kubernetes.io/cluster-service: "true"
version: v6.2.5
name: es-data
namespace: default
spec:
podManagementPolicy: OrderedReady
replicas: 12
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: es
version: v6.2.5
serviceName: es
template:
metadata:
labels:
k8s-app: es
kubernetes.io/cluster-service: "true"
version: v6.2.5
spec:
containers:
- env:
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ELASTICSEARCH_SERVICE_NAME
value: es
- name: NODE_MASTER
value: "false"
- name: NODE_DATA
value: "true"
- name: ES_HEAP_SIZE
value: 16g
- name: ES_JAVA_OPTS
value: -Xmx16g -Xms16g
- name: cluster.name
value: es
image: elasticsearch:v6.2.5
imagePullPolicy: Always
name: es
ports:
- containerPort: 9200
hostPort: 9200
name: db
protocol: TCP
- containerPort: 9300
hostPort: 9300
name: transport
protocol: TCP
resources:
limits:
cpu: "8"
memory: 32Gi
requests:
cpu: "7"
memory: 30Gi
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
volumeMounts:
- mountPath: /data
name: es
- command:
- /bin/elasticsearch_exporter
- -es.uri=http://localhost:9200
- -es.all=true
image: elasticsearch_exporter:1.0.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9108
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: es-exporter
ports:
- containerPort: 9108
hostPort: 9108
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9108
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 25m
memory: 64Mi
securityContext:
capabilities:
drop:
- SETPCAP
- MKNOD
- AUDIT_WRITE
- CHOWN
- NET_RAW
- DAC_OVERRIDE
- FOWNER
- FSETID
- KILL
- SETGID
- SETUID
- NET_BIND_SERVICE
- SYS_CHROOT
- SETFCAP
readOnlyRootFilesystem: true
dnsPolicy: ClusterFirst
initContainers:
- command:
- /sbin/sysctl
- -w
- vm.max_map_count=262144
image: alpine:3.6
imagePullPolicy: IfNotPresent
name: elasticsearch-logging-init
resources: {}
securityContext:
privileged: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
volumes:
- hostPath:
path: /Data/es
type: DirectoryOrCreate
name: es
es-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
k8s-app: es
kubernetes.io/cluster-service: "true"
kubernetes.io/name: Elasticsearch
name: es
namespace: default
spec:
clusterIP: None
ports:
- name: es
port: 9200
protocol: TCP
targetPort: 9200
- name: exporter
port: 9108
protocol: TCP
targetPort: 9108
selector:
k8s-app: es
sessionAffinity: None
type: ClusterIP
ES 集群监控
工欲善其事必先利其器,中间件的运维首先要有充沛的监控伎俩,ES 集群的监控罕用的三种监控伎俩:exporter、eshead、kopf,因为 ES 集群是采纳 k8s 架构部署,很多个性都会联合 k8s 来发展
Grafana 监控
通过 k8s 部署 es-exporter 将监控 metrics 导出,prometheus 采集监控数据,grafana 定制 dashboard 展现
ES-head 组件
github 地址:https://github.com/mobz/elast…
ES-head 组件可通过谷歌浏览器利用商店搜寻装置,应用 Chrome 插件可查看 ES 集群的状况
Cerebro(kopf)组件
github 地址:https://github.com/lmenezes/c…
ES 集群问题解决
ES 配置
资源配置:关注 ES 的 CPU、Memory 以及 Heap Size,Xms Xmx 的配置,倡议如机器是 8u32g 内存的状况下,堆内存和 Xms Xmx 配置为 50%,官网倡议单个 node 的内存不要超过 64G
索引配置:因为 ES 检索通过索引来定位,检索的时候 ES 会将相干的索引数据装载到内存中放慢检索速度,因而正当的对索引进行设置对 ES 的性能影响很大,以后咱们通过按日期创立索引的办法(个别数据量小的可不宰割索引)
ES 负载
CPU 和 Load 比拟高的节点重点关注,可能的起因是 shard 调配不平均,此时可手动讲不平衡的 shard relocate 一下
shard 配置
shard 配置最好是 data node 数量的整数倍,shard 数量不是越多越好,应该依照索引的数据量正当进行分片,确保每个 shard 不要超过单个 data node 调配的堆内存大小,比方数据量最大的 index 单日 150G 左右,分为 24 个 shard,计算下来单个 shard 大小大略 6 -7G 左右
正本数倡议为 1,正本数过大,容易导致数据的频繁 relocate,加大集群负载
删除异常 index
curl -X DELETE "10.64.xxx.xx:9200/szv-prod-ingress-nginx-2021.05.01"
索引名可应用进行正则匹配进行批量删除,如:-2021.05.*
节点负载高的另一个起因
在定位问题的时候发现节点数据 shard 曾经移走然而节点负载始终下不去,登入节点应用 top 命令发现节点 kubelet 的 cpu 占用十分高,重启 kubelet 也有效,重启节点后负载才失去缓解
ES 集群惯例运维经验总结(参考官网)
查看集群衰弱状态
ES 集群的衰弱状态分为三种:Green、Yellow、Red。
- Green(绿色):集群衰弱;
- Yellow(黄色):集群非衰弱,但在负载容许范畴内可主动 rebalance 复原;
- Red(红色):集群存在问题,有局部数据未就绪,至多有一个主分片未调配胜利。
可通过 API 查问集群的衰弱状态及未调配的分片:
GET _cluster/health
{
"cluster_name": "camp-es",
"status": "green",
"timed_out": false,
"number_of_nodes": 15,
"number_of_data_nodes": 12,
"active_primary_shards": 2176,
"active_shards": 4347,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100
}
查看 pending tasks:
GET /_cat/pending_tasks
其中 priority 字段则示意该 task 的优先级
查看分片未调配起因
GET _cluster/allocation/explain
其中 reason 字段示意哪种起因导致的分片未调配,detail 示意具体未调配的起因
查看所有未调配的索引和主分片:
GET /_cat/indices?v&health=red
查看哪些分片出现异常
curl -s http://ip:port/_cat/shards | grep UNASSIGNED
重新分配一个主分片:
POST _cluster/reroute?pretty"-d'{
"commands" : [
{
"allocate_stale_primary" : {
"index" : "xxx",
"shard" : 1,
"node" : "12345...",
"accept_data_loss": true
}
}
]
}
其中 node 为 es 集群节点的 id,能够通过 curl‘ip:port/_node/process?pretty’进行查问
升高索引的正本的数量
PUT /szv_ingress_*/settings
{
"index": {"number_of_replicas": 1}
}
点击关注,第一工夫理解华为云陈腐技术~