关于tidb:Chaos-Mesh-实战分享丨通过混沌工程验证-GreatDB-分布式部署模式的稳定性

Chaos Mesh 最后作为开源分布式数据库 TiDB 的测试平台而创立，是一个多功能混沌工程平台，通过混沌测试验证分布式系统的稳定性。本文以万里平安数据库软件 GreatDB 分布式部署模式为例，介绍了通过 Chaos Mesh 进行混沌测试的全流程。

混沌测试是检测分布式系统不确定性、建设零碎弹性信念的一种十分好的形式，因而咱们采纳开源工具 Chaos Mesh 来做 GreatDB 分布式集群的混沌测试。

万里平安数据库软件 GreatDB 是一款关系型数据库软件，同时反对集中式和分布式的部署形式，本文波及的是分布式部署形式。

分布式部署模式采纳 shared-nothing 架构；通过数据冗余与正本治理确保数据库无单点故障；数据 sharding 与分布式并行计算实现数据库系统高性能；可无限度动静扩大数据节点，满足业务须要。

整体架构如下图所示：

在装置 Chaos Mesh 之前请确保曾经事后装置了 helm，docker，并筹备好了一个 kubernetes 环境。

应用 Helm 装置

1）在 Helm 仓库中增加 Chaos Mesh 仓库：

helm repo add chaos-mesh https://charts.chaos-mesh.org

2）查看能够装置的 Chaos Mesh 版本：

helm search repo chaos-mesh

3）创立装置 Chaos Mesh 的命名空间：

kubectl create ns chaos-testing

4）在 docker 环境下装置 Chaos Mesh：

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing

验证装置

执行以下命令查看 Chaos Mesh 的运行状况：

kubectl get pod -n chaos-testing

上面是预期输入：

NAME                                       READY   STATUS    RESTARTS   AGE

chaos-controller-manager-d7bc9ccb5-dbccq   1/1     Running   0          26d

chaos-daemon-pzxc7                         1/1     Running   0          26d

chaos-dashboard-5887f7559b-kgz46           1/1     Running   1          26d

如果 3 个 pod 的状态都是 Running，示意 Chaos Mesh 曾经胜利装置。

个别状况下，MySQL 应用官网 5.7 版本的镜像，MySQL 监控采集器应用的是 mysqld-exporter，也能够间接从 docker hub 下载：

docker pull mysql:5.7

docker pull prom/mysqld-exporter

ZooKeeper 应用的是官网 3.5.5 版本镜像，ZooKeeper 组件波及的监控有 jmx-prometheus-exporter 和 zookeeper-exporter，均从 docker hub 下载：

docker pull zookeeper:3.5.5

docker pull sscaling/jmx-prometheus-exporter

docker pull josdotso/zookeeper-exporter

抉择一个 GreatDB 的 tar 包，将其解压失去一个 ./greatdb 目录，再将 greatdb-service-docker.sh 文件拷贝到这个解压进去的./greatdb 目录里：

cp greatdb-service-docker.sh ./greatdb/

将 greatdb Dockerfile 放到./greatdb 文件夹的同级目录下，而后执行以下命令构建 GreatDB 镜像：

docker build -t greatdb/greatdb:tag2021 .

下载集群部署脚本 cluster-setup，集群初始化脚本 init-zk 以及集群 helm charts 包（可征询 4.0 开发 / 测试组获取）

将上述资料放在同一目录下，编写如下 Dockerfile:

FROM debian:buster-slim as init-zk



COPY ./init-zk /root/init-zk

RUN chmod +x /root/init-zk



FROM debian:buster-slim as cluster-setup

\*# Set aliyun repo for speed*

RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list && \

  sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list



RUN apt-get -y update && \

  apt-get -y install \

  curl \

  wget



RUN curl -L https://storage.googleapis.com/kubernetes-release/release/v1.20.1/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && \

  chmod +x /usr/local/bin/kubectl && \

  mkdir /root/.kube && \

  wget https://get.helm.sh/helm-v3.5.3-linux-amd64.tar.gz && \

  tar -zxvf helm-v3.5.3-linux-amd64.tar.gz && \

  mv linux-amd64/helm /usr/local/bin/helm



COPY ./config /root/.kube/

COPY ./helm /helm

COPY ./cluster-setup /

执行以下命令构建所需镜像：

docker build --target init-zk -t greatdb/initzk:latest .


docker build --target cluster-setup -t greatdb/cluster-setup:v1 .

目前测试反对的用例有：bank、bank2、pbank、tpcc、flashback 等，每个用例都是一个可执行文件。

以 flashback 测例为例构建测试用例的镜像，先将用例下载到本地，在用例的同一目录下编写如下内容的 Dockerfile：

FROM debian:buster-slim

COPY ./flashback /

RUN cd / && chmod +x ./flashback

执行以下命令构建测试用例镜像：

docker build -t greatdb/testsuite-flashback:v1 .

创立公有仓库和上传镜像操作请参考：https://zhuanlan.zhihu.com/p/…

在上一章 2.2.4 中 cluster-setup 目录下执行以下命令块去搭建测试集群：

./cluster-setup  \

-clustername=c0 \

-namespace=test \

-enable-monitor=true \

-mysql-image=mysql:5.7 \

-mysql-replica=3 \

-mysql-auth=1 \

-mysql-normal=1 \

-mysql-global=1 \

-mysql-partition=1 \

-zookeeper-repository=zookeeper \

-zookeeper-tag=3.5.5 \

-zookeeper-replica=3 \

-greatdb-repository=greatdb/greatdb \

-greatdb-tag=tag202110 \

-greatdb-replica=3 \

-greatdb-serviceHost=172.16.70.249

输入信息：

liuxinle@liuxinle-OptiPlex-5060:~/k8s/cluster-setup$ ./cluster-setup \

\> -clustername=c0 \

\> -namespace=test \

\> -enable-monitor=true \

\> -mysql-image=mysql:5.7 \

\> -mysql-replica=3 \

\> -mysql-auth=1 \

\> -mysql-normal=1 \

\> -mysql-global=1 \

\> -mysql-partition=1 \

\> -zookeeper-repository=zookeeper \

\> -zookeeper-tag=3.5.5 \

\> -zookeeper-replica=3 \

\> -greatdb-repository=greatdb/greatdb \

\> -greatdb-tag=tag202110 \

\> -greatdb-replica=3 \

\> -greatdb-serviceHost=172.16.70.249

INFO[2021-10-14T10:41:52+08:00] SetUp the cluster ...                         NameSpace=test

INFO[2021-10-14T10:41:52+08:00] create namespace ...                         

INFO[2021-10-14T10:41:57+08:00] copy helm chart templates ...                

INFO[2021-10-14T10:41:57+08:00] setup ...                                     Component=MySQL

INFO[2021-10-14T10:41:57+08:00] exec helm install and update greatdb-cfg.yaml ... 

INFO[2021-10-14T10:42:00+08:00] waiting mysql pods running ...               

INFO[2021-10-14T10:44:27+08:00] setup ...                                     Component=Zookeeper

INFO[2021-10-14T10:44:28+08:00] waiting zookeeper pods running ...           

INFO[2021-10-14T10:46:59+08:00] update greatdb-cfg.yaml                      

INFO[2021-10-14T10:46:59+08:00] setup ...                                     Component=greatdb

INFO[2021-10-14T10:47:00+08:00] waiting greatdb pods running ...             

INFO[2021-10-14T10:47:21+08:00] waiting cluster running ...                  

INFO[2021-10-14T10:47:27+08:00] waiting prometheus server running...         

INFO[2021-10-14T10:47:27+08:00] Dump Cluster Info                            

INFO[2021-10-14T10:47:27+08:00] SetUp success.                                ClusterName=c0 NameSpace=test

执行如下命令，查看集群 pod 状态：

kubectl get pod -n test -o wide

输入信息：

NAME                                    READY   STATUS      RESTARTS   AGE     IP             NODE                     NOMINATED NODE   READINESS GATES

c0-auth0-mysql-0                        2/2     Running     0          10m     10.244.87.18   liuxinle-optiplex-5060   <none>           <none>

c0-auth0-mysql-1                        2/2     Running     0          9m23s   10.244.87.54   liuxinle-optiplex-5060   <none>           <none>

c0-auth0-mysql-2                        2/2     Running     0          8m39s   10.244.87.57   liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-0                            2/2     Running     1          5m3s    10.244.87.58   liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-1                            2/2     Running     0          4m57s   10.244.87.20   liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-2                            2/2     Running     0          4m50s   10.244.87.47   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-0                        2/2     Running     0          10m     10.244.87.51   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-1                        2/2     Running     0          9m23s   10.244.87.41   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-2                        2/2     Running     0          8m38s   10.244.87.60   liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-0                         2/2     Running     0          10m     10.244.87.29   liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-1                         2/2     Running     0          9m29s   10.244.87.4    liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-2                         2/2     Running     0          8m45s   10.244.87.25   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-0                         2/2     Running     0          10m     10.244.87.55   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-1                         2/2     Running     0          9m26s   10.244.87.13   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-2                         2/2     Running     0          8m42s   10.244.87.21   liuxinle-optiplex-5060   <none>           <none>

c0-prometheus-server-6697649b76-fkvh9   2/2     Running     0          4m36s   10.244.87.37   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-0                          1/1     Running     1          7m35s   10.244.87.44   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-1                          1/1     Running     0          6m41s   10.244.87.30   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-2                          1/1     Running     0          6m10s   10.244.87.49   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-initzk-7hbfs               0/1     Completed   0          7m35s   10.244.87.17   liuxinle-optiplex-5060   <none>           <none>

看到 c0-zookeeper-initzk-7hbfs 的状态是 Completed，其余 pod 的状态为 Running，示意集群搭建胜利。

Chaos Mesh 在 kubernetes 环境反对注入的故障类型包含：模仿 Pod 故障、模仿网络故障、模仿压力场景等，这里咱们以模仿 Pod 故障中的 pod-kill 为例。

将试验配置写入到文件中 pod-kill.yaml，内容示例如下：

apiVersion: chaos-mesh.org/v1alpha1

kind: PodChaos   *# 要注入的故障类型 *

metadata:

  name: pod-failure-example

  namespace: test   *# 测试集群 pod 所在的 namespace*

spec:

  action: pod-kill   *# 要注入的具体故障类型 *

  mode: all    *# 指定试验的运行形式，all（示意选出所有符合条件的 Pod）*

  duration: '30s'    *# 指定试验的持续时间 * 

  selector: 

    labelSelectors:

      "app.kubernetes.io/component": "greatdb"    *# 指定注入故障指标 pod 的标签，通过 kubectl describe pod c0-greatdb-1 -n test 命令返回后果中 Labels 后的内容失去 *

创立故障试验，命令如下：

kubectl create -n test -f pod-kill.yaml

创立完故障试验之后，执行命令 kubectl get pod -n test -o wide 后果如下：

NAME                                    READY   STATUS              RESTARTS   AGE     IP             NODE                     NOMINATED NODE   READINESS GATES

c0-auth0-mysql-0                        2/2     Running             0          14m     10.244.87.18   liuxinle-optiplex-5060   <none>           <none>

c0-auth0-mysql-1                        2/2     Running             0          14m     10.244.87.54   liuxinle-optiplex-5060   <none>           <none>

c0-auth0-mysql-2                        2/2     Running             0          13m     10.244.87.57   liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-0                            0/2     ContainerCreating   0          2s      <none>         liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-1                            0/2     ContainerCreating   0          2s      <none>         liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-0                        2/2     Running             0          14m     10.244.87.51   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-1                        2/2     Running             0          14m     10.244.87.41   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-2                        2/2     Running             0          13m     10.244.87.60   liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-0                         2/2     Running             0          14m     10.244.87.29   liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-1                         2/2     Running             0          14m     10.244.87.4    liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-2                         2/2     Running             0          13m     10.244.87.25   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-0                         2/2     Running             0          14m     10.244.87.55   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-1                         2/2     Running             0          14m     10.244.87.13   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-2                         2/2     Running             0          13m     10.244.87.21   liuxinle-optiplex-5060   <none>           <none>

c0-prometheus-server-6697649b76-fkvh9   2/2     Running             0          9m24s   10.244.87.37   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-0                          1/1     Running             1          12m     10.244.87.44   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-1                          1/1     Running             0          11m     10.244.87.30   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-2                          1/1     Running             0          10m     10.244.87.49   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-initzk-7hbfs               0/1     Completed           0          12m     10.244.87.17   liuxinle-optiplex-5060   <none>           <none>

能够看到有带 greatdb 名字的 pod 正在被重启，阐明注入故障胜利。

Argo 是一个开源的容器本地工作流引擎，用于在 Kubernetes 上实现工作，能够将多步骤工作流建模为一系列工作，实现测试流程编排。

咱们应用 argo 定义一个测试工作，根本的测试流程是固定的，如下所示：

测试流程的 step1 是部署测试集群，接着开启两个并行任务，step2 跑测试用例，模仿业务场景，step3 同时应用 Chaos Mesh 注入故障，step2 的测试用例执行完结之后，step4 终止故障注入，最初 step5 清理集群环境。

1）批改 cluster-setup.yaml 中的 image 信息，改成步骤 2.2 筹备测试须要的镜像中本人传上去的集群部署 / 清理镜像名和 tag

2）批改 testsuite-flashback.yaml 中的 image 信息，改成步骤 2.2 筹备测试须要的镜像中本人传上去的测试用例镜像名和 tag

3）将集群部署、测试用例和工具模板的 yaml 文件全副应用 kubectl apply -n argo -f xxx.yaml 命令创立资源（这些文件定义了一些 argo template，不便用户写 workflow 时候应用）

kubectl apply -n argo -f cluster-setup.yaml

kubectl apply -n argo -f testsuite-flashback.yaml

kubectl apply -n argo -f tools-template.yaml

4）复制一份 workflow 模板文件 workflow-template.yaml，将模板文件中正文提醒的局部批改为本人的设置即可，而后执行以下命令创立混沌测试工作流：

kubectl apply -n argo -f workflow-template.yaml

以下是一份 workflow 模板文件：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: chaostest-c0-0-
  name: chaostest-c0-0
  namespace: argo
spec:
  entrypoint: test-entry #测试入口，在这里传入测试参数，填写 clustername、namespace、host、greatdb 镜像名和 tag 名等根本信息
  serviceAccountName: argo
  arguments:
    parameters:
      - name: clustername
        value: c0
      - name: namespace
        value: test
      - name: host
        value: 172.16.70.249
      - name: port
        value: 30901
      - name: password
        value: Bgview@2020
      - name: user
        value: root
      - name: run-time
        value: 10m
      - name: greatdb-repository
        value: greatdb/greatdb
      - name: greatdb-tag
        value: tag202110
      - name: nemesis
        value: kill_mysql_normal_master,kill_mysql_normal_slave,kill_mysql_partition_master,kill_mysql_partition_slave,kill_mysql_auth_master,kill_mysql_auth_slave,kill_mysql_global_master,kill_mysql_global_slave,kill_mysql_master,kill_mysql_slave,net_partition_mysql_normal,net_partition_mysql_partition,net_partition_mysql_auth,net_partition_mysql_global
      - name: mysql-partition
        value: 1
      - name: mysql-global
        value: 1
      - name: mysql-auth
        value: 1
      - name: mysql-normal
        value: 2
  templates:
    - name: test-entry
      steps:
        - - name: setup-greatdb-cluster  # step.1 集群部署. 请指定正确的参数，次要是 mysql 和 zookeeper 的镜像名、tag 名
            templateRef:
              name: cluster-setup-template
              template: cluster-setup
            arguments:
              parameters:
                - name: namespace
                  value: "{{workflow.parameters.namespace}}"
                - name: clustername
                  value: "{{workflow.parameters.clustername}}"
                - name: mysql-image
                  value: mysql:5.7.34
                - name: mysql-replica
                  value: 3
                - name: mysql-auth
                  value: "{{workflow.parameters.mysql-auth}}"
                - name: mysql-normal
                  value: "{{workflow.parameters.mysql-normal}}"
                - name: mysql-partition
                  value: "{{workflow.parameters.mysql-partition}}"
                - name: mysql-global
                  value: "{{workflow.parameters.mysql-global}}"
                - name: enable-monitor
                  value: false
                - name: zookeeper-repository
                  value: zookeeper
                - name: zookeeper-tag
                  value: 3.5.5
                - name: zookeeper-replica
                  value: 3
                - name: greatdb-repository
                  value: "{{workflow.parameters.greatdb-repository}}"
                - name: greatdb-tag
                  value: "{{workflow.parameters.greatdb-tag}}"
                - name: greatdb-replica
                  value: 3
                - name: greatdb-serviceHost
                  value: "{{workflow.parameters.host}}"
                - name: greatdb-servicePort
                  value: "{{workflow.parameters.port}}"
        - - name: run-flashbacktest    # step.2 运行测试用例, 请替换为你要运行的测试用例 template 并指定正确的参数，次要是测试应用的表个数和大小
            templateRef:
              name: flashback-test-template
              template: flashback
            arguments:
              parameters:
                - name: user
                  value: "{{workflow.parameters.user}}"
                - name: password
                  value: "{{workflow.parameters.password}}"
                - name: host
                  value: "{{workflow.parameters.host}}"
                - name: port
                  value: "{{workflow.parameters.port}}"
                - name: concurrency
                  value: 16
                - name: size
                  value: 10000
                - name: tables
                  value: 10
                - name: run-time
                  value: "{{workflow.parameters.run-time}}"
                - name: single-statement
                  value: true
                - name: manage-statement
                  value: true
          - name: invoke-chaos-for-flashabck-test    # step.3 注入故障，请指定正确的参数，这里 run-time 和 interval 别离定义了故障注入的工夫和频次，因而省略掉了终止故障注入步骤
            templateRef:
              name: chaos-rto-template
              template: chaos-rto
            arguments:
              parameters:
                - name: user
                  value: "{{workflow.parameters.user}}"
                - name: host
                  value: "{{workflow.parameters.host}}"
                - name: password
                  value: "{{workflow.parameters.password}}"
                - name: port
                  value: "{{workflow.parameters.port}}"
                - name: k8s-config
                  value: /root/.kube/config
                - name: namespace
                  value: "{{workflow.parameters.namespace}}"
                - name: clustername
                  value: "{{workflow.parameters.clustername}}"
                - name: prometheus
                  value: ''
                - name: greatdb-job
                  value: greatdb-monitor-greatdb
                - name: nemesis
                  value: "{{workflow.parameters.nemesis}}"
                - name: nemesis-duration
                  value: 1m
                - name: nemesis-mode
                  value: default
                - name: wait-time
                  value: 5m
                - name: check-time
                  value: 5m
                - name: nemesis-scope
                  value: 1
                - name: nemesis-log
                  value: true
                - name: enable-monitor
                  value: false
                - name: run-time
                  value: "{{workflow.parameters.run-time}}"
                - name: interval
                  value: 1m
                - name: monitor-log
                  value: false
                - name: enable-rto
                  value: false
                - name: rto-qps
                  value: 0.1
                - name: rto-warm
                  value: 5m
                - name: rto-time
                  value: 1m
                - name: log-level
                  value: debug
        - - name: flashbacktest-output         # 输入测试用例是否通过的后果
            templateRef:
              name: tools-template
              template: output-result
            arguments:
              parameters:
                - name: info
                  value: "flashback test pass, with nemesis: {{workflow.parameters.nemesis}}"
        - - name: clean-greatdb-cluster           # step.4 清理测试集群，这里的参数和 step.1 的参数统一
            templateRef:
              name: cluster-setup-template
              template: cluster-setup
            arguments:
              parameters:
                - name: namespace
                  value: "{{workflow.parameters.namespace}}"
                - name: clustername
                  value: "{{workflow.parameters.clustername}}"
                - name: mysql-image
                  value: mysql:5.7
                - name: mysql-replica
                  value: 3
                - name: mysql-auth
                  value: "{{workflow.parameters.mysql-auth}}"
                - name: mysql-normal
                  value: "{{workflow.parameters.mysql-normal}}"
                - name: mysql-partition
                  value: "{{workflow.parameters.mysql-partition}}"
                - name: mysql-global
                  value: "{{workflow.parameters.mysql-global}}"
                - name: enable-monitor
                  value: false
                - name: zookeeper-repository
                  value: zookeeper
                - name: zookeeper-tag
                  value: 3.5.5
                - name: zookeeper-replica
                  value: 3
                - name: greatdb-repository
                  value: "{{workflow.parameters.greatdb-repository}}"
                - name: greatdb-tag
                  value: "{{workflow.parameters.greatdb-tag}}"
                - name: greatdb-replica
                  value: 3
                - name: greatdb-serviceHost
                  value: "{{workflow.parameters.host}}"
                - name: greatdb-servicePort
                  value: "{{workflow.parameters.port}}"
                - name: clean
                  value: true
        - - name: echo-result
            templateRef:
              name: tools-template
              template: echo
            arguments:
              parameters:
                - name: info
                  value: "{{item}}"
            withItems:
              - "{{steps.flashbacktest-output.outputs.parameters.result}}"

至此，你曾经胜利应用 Chaos Mesh 进行了一次混沌测试，并胜利验证了分布式系统的稳定性。

Now enjoy GreatSQL, and enjoy Chaos Mesh :)

关于tidb:Chaos-Mesh-实战分享丨通过混沌工程验证-GreatDB-分布式部署模式的稳定性

1. 需要背景与万里平安数据库软件 GreatDB 分布式部署模式介绍

1.1 需要背景

1.2 万里平安数据库软件 GreatDB 分布式部署模式介绍

2. 环境筹备

2.1 Chaos Mesh 装置

2.2 筹备测试须要的镜像

2.2.1 筹备 MySQL 镜像

2.2.2 筹备 ZooKeeper 镜像

2.2.3 筹备 GreatDB 镜像

2.2.4 筹备 GreatDB 分布式集群部署 / 清理的镜像

2.2.5 筹备测试用例的镜像

2.3 将筹备好的镜像上传到公有仓库中

3. Chaos Mesh 的应用

3.1 搭建 GreatDB 分布式集群

3.2 在 GreatDB 分布式集群中应用 Chaos Mesh 做混沌测试

4. 在 argo 中编排测试流程

4.1 用 argo 编排一个混沌测试工作流（以 flashback 测试用例为例