摘要:ES集群是进行大数据存储和剖析,疾速检索的利器,本文简述了ES的集群架构,并提供了在Kubernetes中疾速部署ES集群的样例;对ES集群的监控运维工具进行了介绍,并提供了局部问题定位教训,最初总结了罕用ES集群的API调用办法。
本文分享自华为云社区《Kubernetes中部署ES集群及运维》,原文作者:minucas。
ES集群架构:
ES集群分为单点模式和集群模式,其中单点模式个别在生产环境不举荐应用,举荐应用集群模式部署。其中集群模式又分为Master节点与Data节点由同一个节点承当,以及Master节点与Data节点由不同节点承当的部署模式。Master节点与Data节点离开的部署形式可靠性更强。下图为ES集群的部署架构图:
采纳K8s进行ES集群部署:
1、采纳k8s statefulset部署,可疾速的进行扩缩容es节点,本例子采纳 3 Master Node + 12 Data Node 形式部署
2、通过k8s service配置了对应的域名和服务发现,确保集群能主动联通和监控
kubectl -s http://ip:port create -f es-master.yamlkubectl -s http://ip:port create -f es-data.yamlkubectl -s http://ip:port create -f es-service.yaml
es-master.yaml:
apiVersion: apps/v1kind: StatefulSetmetadata: labels: addonmanager.kubernetes.io/mode: Reconcile k8s-app: es kubernetes.io/cluster-service: "true" version: v6.2.5 name: es-master namespace: defaultspec: podManagementPolicy: OrderedReady replicas: 3 revisionHistoryLimit: 10 selector: matchLabels: k8s-app: es version: v6.2.5 serviceName: es template: metadata: labels: k8s-app: camp-es kubernetes.io/cluster-service: "true" version: v6.2.5 spec: containers: - env: - name: NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: ELASTICSEARCH_SERVICE_NAME value: es - name: NODE_MASTER value: "true" - name: NODE_DATA value: "false" - name: ES_HEAP_SIZE value: 4g - name: ES_JAVA_OPTS value: -Xmx4g -Xms4g - name: cluster.name value: es image: elasticsearch:v6.2.5 imagePullPolicy: Always name: es ports: - containerPort: 9200 hostPort: 9200 name: db protocol: TCP - containerPort: 9300 hostPort: 9300 name: transport protocol: TCP resources: limits: cpu: "6" memory: 12Gi requests: cpu: "4" memory: 8Gi securityContext: capabilities: add: - IPC_LOCK - SYS_RESOURCE volumeMounts: - mountPath: /data name: es - command: - /bin/elasticsearch_exporter - -es.uri=http://localhost:9200 - -es.all=true image: elasticsearch_exporter:1.0.2 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /health port: 9108 scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 name: es-exporter ports: - containerPort: 9108 hostPort: 9108 protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /health port: 9108 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 resources: limits: cpu: 100m memory: 128Mi requests: cpu: 25m memory: 64Mi securityContext: capabilities: drop: - SETPCAP - MKNOD - AUDIT_WRITE - CHOWN - NET_RAW - DAC_OVERRIDE - FOWNER - FSETID - KILL - SETGID - SETUID - NET_BIND_SERVICE - SYS_CHROOT - SETFCAP readOnlyRootFilesystem: true dnsPolicy: ClusterFirst initContainers: - command: - /sbin/sysctl - -w - vm.max_map_count=262144 image: alpine:3.6 imagePullPolicy: IfNotPresent name: elasticsearch-logging-init resources: {} securityContext: privileged: true restartPolicy: Always schedulerName: default-scheduler securityContext: {} volumes: - hostPath: path: /Data/es type: DirectoryOrCreate name: es
es-data.yaml
apiVersion: apps/v1kind: StatefulSetmetadata: labels: addonmanager.kubernetes.io/mode: Reconcile k8s-app: es kubernetes.io/cluster-service: "true" version: v6.2.5 name: es-data namespace: defaultspec: podManagementPolicy: OrderedReady replicas: 12 revisionHistoryLimit: 10 selector: matchLabels: k8s-app: es version: v6.2.5 serviceName: es template: metadata: labels: k8s-app: es kubernetes.io/cluster-service: "true" version: v6.2.5 spec: containers: - env: - name: NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: ELASTICSEARCH_SERVICE_NAME value: es - name: NODE_MASTER value: "false" - name: NODE_DATA value: "true" - name: ES_HEAP_SIZE value: 16g - name: ES_JAVA_OPTS value: -Xmx16g -Xms16g - name: cluster.name value: es image: elasticsearch:v6.2.5 imagePullPolicy: Always name: es ports: - containerPort: 9200 hostPort: 9200 name: db protocol: TCP - containerPort: 9300 hostPort: 9300 name: transport protocol: TCP resources: limits: cpu: "8" memory: 32Gi requests: cpu: "7" memory: 30Gi securityContext: capabilities: add: - IPC_LOCK - SYS_RESOURCE volumeMounts: - mountPath: /data name: es - command: - /bin/elasticsearch_exporter - -es.uri=http://localhost:9200 - -es.all=true image: elasticsearch_exporter:1.0.2 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /health port: 9108 scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 name: es-exporter ports: - containerPort: 9108 hostPort: 9108 protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /health port: 9108 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 resources: limits: cpu: 100m memory: 128Mi requests: cpu: 25m memory: 64Mi securityContext: capabilities: drop: - SETPCAP - MKNOD - AUDIT_WRITE - CHOWN - NET_RAW - DAC_OVERRIDE - FOWNER - FSETID - KILL - SETGID - SETUID - NET_BIND_SERVICE - SYS_CHROOT - SETFCAP readOnlyRootFilesystem: true dnsPolicy: ClusterFirst initContainers: - command: - /sbin/sysctl - -w - vm.max_map_count=262144 image: alpine:3.6 imagePullPolicy: IfNotPresent name: elasticsearch-logging-init resources: {} securityContext: privileged: true restartPolicy: Always schedulerName: default-scheduler securityContext: {} volumes: - hostPath: path: /Data/es type: DirectoryOrCreate name: es
es-service.yaml
apiVersion: v1kind: Servicemetadata: labels: addonmanager.kubernetes.io/mode: Reconcile k8s-app: es kubernetes.io/cluster-service: "true" kubernetes.io/name: Elasticsearch name: es namespace: defaultspec: clusterIP: None ports: - name: es port: 9200 protocol: TCP targetPort: 9200 - name: exporter port: 9108 protocol: TCP targetPort: 9108 selector: k8s-app: es sessionAffinity: None type: ClusterIP
ES集群监控
工欲善其事必先利其器,中间件的运维首先要有充沛的监控伎俩,ES集群的监控罕用的三种监控伎俩:exporter、eshead、kopf,因为ES集群是采纳k8s架构部署,很多个性都会联合k8s来发展
Grafana监控
通过k8s部署es-exporter将监控metrics导出,prometheus采集监控数据,grafana定制dashboard展现
ES-head组件
github地址:https://github.com/mobz/elast...
ES-head组件可通过谷歌浏览器利用商店搜寻装置,应用Chrome插件可查看ES集群的状况
Cerebro(kopf)组件
github地址:https://github.com/lmenezes/c...
ES集群问题解决
ES配置
资源配置:关注ES的CPU、Memory以及Heap Size,Xms Xmx的配置,倡议如机器是8u32g内存的状况下,堆内存和Xms Xmx配置为50%,官网倡议单个node的内存不要超过64G
索引配置:因为ES检索通过索引来定位,检索的时候ES会将相干的索引数据装载到内存中放慢检索速度,因而正当的对索引进行设置对ES的性能影响很大,以后咱们通过按日期创立索引的办法(个别数据量小的可不宰割索引)
ES负载
CPU和Load比拟高的节点重点关注,可能的起因是shard调配不平均,此时可手动讲不平衡的shard relocate一下
shard配置
shard配置最好是data node数量的整数倍,shard数量不是越多越好,应该依照索引的数据量正当进行分片,确保每个shard不要超过单个data node调配的堆内存大小,比方数据量最大的index单日150G左右,分为24个shard,计算下来单个shard大小大略6-7G左右
正本数倡议为1,正本数过大,容易导致数据的频繁relocate,加大集群负载
删除异常index
curl -X DELETE "10.64.xxx.xx:9200/szv-prod-ingress-nginx-2021.05.01"
索引名可应用进行正则匹配进行批量删除,如:-2021.05.*
节点负载高的另一个起因
在定位问题的时候发现节点数据shard曾经移走然而节点负载始终下不去,登入节点应用top命令发现节点kubelet的cpu占用十分高,重启kubelet也有效,重启节点后负载才失去缓解
ES集群惯例运维经验总结(参考官网)
查看集群衰弱状态
ES集群的衰弱状态分为三种:Green、Yellow、Red。
- Green(绿色):集群衰弱;
- Yellow(黄色):集群非衰弱,但在负载容许范畴内可主动rebalance复原;
- Red(红色):集群存在问题,有局部数据未就绪,至多有一个主分片未调配胜利。
可通过API查问集群的衰弱状态及未调配的分片:
GET _cluster/health{ "cluster_name": "camp-es", "status": "green", "timed_out": false, "number_of_nodes": 15, "number_of_data_nodes": 12, "active_primary_shards": 2176, "active_shards": 4347, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 100}
查看pending tasks:
GET /_cat/pending_tasks
其中 priority 字段则示意该 task 的优先级
查看分片未调配起因
GET _cluster/allocation/explain
其中reason 字段示意哪种起因导致的分片未调配,detail 示意具体未调配的起因
查看所有未调配的索引和主分片:
GET /_cat/indices?v&health=red
查看哪些分片出现异常
curl -s http://ip:port/_cat/shards | grep UNASSIGNED
重新分配一个主分片:
POST _cluster/reroute?pretty" -d '{ "commands" : [ { "allocate_stale_primary" : { "index" : "xxx", "shard" : 1, "node" : "12345...", "accept_data_loss": true } } ]}
其中node为es集群节点的id,能够通过curl ‘ip:port/_node/process?pretty’ 进行查问
升高索引的正本的数量
PUT /szv_ingress_*/settings{ "index": { "number_of_replicas": 1 }}
点击关注,第一工夫理解华为云陈腐技术~