Chaos Mesh 最后作为开源分布式数据库 TiDB 的测试平台而创立,是一个多功能混沌工程平台,通过混沌测试验证分布式系统的稳定性。本文以万里平安数据库软件 GreatDB 分布式部署模式为例,介绍了通过 Chaos Mesh 进行混沌测试的全流程。

1. 需要背景与万里平安数据库软件 GreatDB 分布式部署模式介绍

1.1 需要背景

混沌测试是检测分布式系统不确定性、建设零碎弹性信念的一种十分好的形式,因而咱们采纳开源工具 Chaos Mesh 来做 GreatDB 分布式集群的混沌测试。

1.2 万里平安数据库软件 GreatDB 分布式部署模式介绍

万里平安数据库软件 GreatDB 是一款关系型数据库软件,同时反对集中式和分布式的部署形式,本文波及的是分布式部署形式。

分布式部署模式采纳 shared-nothing 架构;通过数据冗余与正本治理确保数据库无单点故障;数据 sharding 与分布式并行计算实现数据库系统高性能;可无限度动静扩大数据节点,满足业务须要。

整体架构如下图所示:

2. 环境筹备

2.1 Chaos Mesh 装置

在装置 Chaos Mesh 之前请确保曾经事后装置了 helm,docker,并筹备好了一个 kubernetes 环境。

  • 应用 Helm 装置

1)在 Helm 仓库中增加 Chaos Mesh 仓库:

helm repo add chaos-mesh https://charts.chaos-mesh.org

2)查看能够装置的 Chaos Mesh 版本:

helm search repo chaos-mesh

3)创立装置 Chaos Mesh 的命名空间:

kubectl create ns chaos-testing

4)在 docker 环境下装置 Chaos Mesh:

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing
  • 验证装置

执行以下命令查看 Chaos Mesh 的运行状况:

kubectl get pod -n chaos-testing

上面是预期输入:

NAME                                       READY   STATUS    RESTARTS   AGEchaos-controller-manager-d7bc9ccb5-dbccq   1/1     Running   0          26dchaos-daemon-pzxc7                         1/1     Running   0          26dchaos-dashboard-5887f7559b-kgz46           1/1     Running   1          26d

如果 3 个 pod 的状态都是 Running,示意 Chaos Mesh 曾经胜利装置。

2.2 筹备测试须要的镜像

2.2.1 筹备 MySQL 镜像

个别状况下,MySQL 应用官网 5.7 版本的镜像,MySQL 监控采集器应用的是 mysqld-exporter,也能够间接从 docker hub 下载:

docker pull mysql:5.7docker pull prom/mysqld-exporter

2.2.2 筹备 ZooKeeper 镜像

ZooKeeper 应用的是官网 3.5.5 版本镜像,ZooKeeper 组件波及的监控有 jmx-prometheus-exporter 和 zookeeper-exporter,均从 docker hub 下载:

docker pull zookeeper:3.5.5docker pull sscaling/jmx-prometheus-exporterdocker pull josdotso/zookeeper-exporter

2.2.3 筹备 GreatDB 镜像

抉择一个 GreatDB 的 tar 包,将其解压失去一个 ./greatdb 目录,再将 greatdb-service-docker.sh 文件拷贝到这个解压进去的./greatdb 目录里:

cp greatdb-service-docker.sh ./greatdb/

将 greatdb Dockerfile 放到./greatdb 文件夹的同级目录下,而后执行以下命令构建 GreatDB 镜像:

docker build -t greatdb/greatdb:tag2021 .

2.2.4 筹备 GreatDB 分布式集群部署/清理的镜像

下载集群部署脚本 cluster-setup,集群初始化脚本 init-zk 以及集群 helm charts 包(可征询 4.0 开发/测试组获取)

将上述资料放在同一目录下,编写如下 Dockerfile:

FROM debian:buster-slim as init-zkCOPY ./init-zk /root/init-zkRUN chmod +x /root/init-zkFROM debian:buster-slim as cluster-setup\*# Set aliyun repo for speed*RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list && \  sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.listRUN apt-get -y update && \  apt-get -y install \  curl \  wgetRUN curl -L https://storage.googleapis.com/kubernetes-release/release/v1.20.1/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && \  chmod +x /usr/local/bin/kubectl && \  mkdir /root/.kube && \  wget https://get.helm.sh/helm-v3.5.3-linux-amd64.tar.gz && \  tar -zxvf helm-v3.5.3-linux-amd64.tar.gz && \  mv linux-amd64/helm /usr/local/bin/helmCOPY ./config /root/.kube/COPY ./helm /helmCOPY ./cluster-setup /

执行以下命令构建所需镜像:

docker build --target init-zk -t greatdb/initzk:latest .docker build --target cluster-setup -t greatdb/cluster-setup:v1 .

2.2.5 筹备测试用例的镜像

目前测试反对的用例有:bank、bank2、pbank、tpcc、flashback 等,每个用例都是一个可执行文件。

以 flashback 测例为例构建测试用例的镜像,先将用例下载到本地,在用例的同一目录下编写如下内容的 Dockerfile:

FROM debian:buster-slimCOPY ./flashback /RUN cd / && chmod +x ./flashback

执行以下命令构建测试用例镜像:

docker build -t greatdb/testsuite-flashback:v1 .

2.3 将筹备好的镜像上传到公有仓库中

创立公有仓库和上传镜像操作请参考:https://zhuanlan.zhihu.com/p/...

3. Chaos Mesh 的应用

3.1 搭建 GreatDB 分布式集群

在上一章 2.2.4 中 cluster-setup 目录下执行以下命令块去搭建测试集群:

./cluster-setup  \-clustername=c0 \-namespace=test \-enable-monitor=true \-mysql-image=mysql:5.7 \-mysql-replica=3 \-mysql-auth=1 \-mysql-normal=1 \-mysql-global=1 \-mysql-partition=1 \-zookeeper-repository=zookeeper \-zookeeper-tag=3.5.5 \-zookeeper-replica=3 \-greatdb-repository=greatdb/greatdb \-greatdb-tag=tag202110 \-greatdb-replica=3 \-greatdb-serviceHost=172.16.70.249

输入信息:

liuxinle@liuxinle-OptiPlex-5060:~/k8s/cluster-setup$ ./cluster-setup \\> -clustername=c0 \\> -namespace=test \\> -enable-monitor=true \\> -mysql-image=mysql:5.7 \\> -mysql-replica=3 \\> -mysql-auth=1 \\> -mysql-normal=1 \\> -mysql-global=1 \\> -mysql-partition=1 \\> -zookeeper-repository=zookeeper \\> -zookeeper-tag=3.5.5 \\> -zookeeper-replica=3 \\> -greatdb-repository=greatdb/greatdb \\> -greatdb-tag=tag202110 \\> -greatdb-replica=3 \\> -greatdb-serviceHost=172.16.70.249INFO[2021-10-14T10:41:52+08:00] SetUp the cluster ...                         NameSpace=testINFO[2021-10-14T10:41:52+08:00] create namespace ...                         INFO[2021-10-14T10:41:57+08:00] copy helm chart templates ...                INFO[2021-10-14T10:41:57+08:00] setup ...                                     Component=MySQLINFO[2021-10-14T10:41:57+08:00] exec helm install and update greatdb-cfg.yaml ... INFO[2021-10-14T10:42:00+08:00] waiting mysql pods running ...               INFO[2021-10-14T10:44:27+08:00] setup ...                                     Component=ZookeeperINFO[2021-10-14T10:44:28+08:00] waiting zookeeper pods running ...           INFO[2021-10-14T10:46:59+08:00] update greatdb-cfg.yaml                      INFO[2021-10-14T10:46:59+08:00] setup ...                                     Component=greatdbINFO[2021-10-14T10:47:00+08:00] waiting greatdb pods running ...             INFO[2021-10-14T10:47:21+08:00] waiting cluster running ...                  INFO[2021-10-14T10:47:27+08:00] waiting prometheus server running...         INFO[2021-10-14T10:47:27+08:00] Dump Cluster Info                            INFO[2021-10-14T10:47:27+08:00] SetUp success.                                ClusterName=c0 NameSpace=test

执行如下命令,查看集群 pod 状态:

kubectl get pod -n test -o wide

输入信息:

NAME                                    READY   STATUS      RESTARTS   AGE     IP             NODE                     NOMINATED NODE   READINESS GATESc0-auth0-mysql-0                        2/2     Running     0          10m     10.244.87.18   liuxinle-optiplex-5060   <none>           <none>c0-auth0-mysql-1                        2/2     Running     0          9m23s   10.244.87.54   liuxinle-optiplex-5060   <none>           <none>c0-auth0-mysql-2                        2/2     Running     0          8m39s   10.244.87.57   liuxinle-optiplex-5060   <none>           <none>c0-greatdb-0                            2/2     Running     1          5m3s    10.244.87.58   liuxinle-optiplex-5060   <none>           <none>c0-greatdb-1                            2/2     Running     0          4m57s   10.244.87.20   liuxinle-optiplex-5060   <none>           <none>c0-greatdb-2                            2/2     Running     0          4m50s   10.244.87.47   liuxinle-optiplex-5060   <none>           <none>c0-glob0-mysql-0                        2/2     Running     0          10m     10.244.87.51   liuxinle-optiplex-5060   <none>           <none>c0-glob0-mysql-1                        2/2     Running     0          9m23s   10.244.87.41   liuxinle-optiplex-5060   <none>           <none>c0-glob0-mysql-2                        2/2     Running     0          8m38s   10.244.87.60   liuxinle-optiplex-5060   <none>           <none>c0-nor0-mysql-0                         2/2     Running     0          10m     10.244.87.29   liuxinle-optiplex-5060   <none>           <none>c0-nor0-mysql-1                         2/2     Running     0          9m29s   10.244.87.4    liuxinle-optiplex-5060   <none>           <none>c0-nor0-mysql-2                         2/2     Running     0          8m45s   10.244.87.25   liuxinle-optiplex-5060   <none>           <none>c0-par0-mysql-0                         2/2     Running     0          10m     10.244.87.55   liuxinle-optiplex-5060   <none>           <none>c0-par0-mysql-1                         2/2     Running     0          9m26s   10.244.87.13   liuxinle-optiplex-5060   <none>           <none>c0-par0-mysql-2                         2/2     Running     0          8m42s   10.244.87.21   liuxinle-optiplex-5060   <none>           <none>c0-prometheus-server-6697649b76-fkvh9   2/2     Running     0          4m36s   10.244.87.37   liuxinle-optiplex-5060   <none>           <none>c0-zookeeper-0                          1/1     Running     1          7m35s   10.244.87.44   liuxinle-optiplex-5060   <none>           <none>c0-zookeeper-1                          1/1     Running     0          6m41s   10.244.87.30   liuxinle-optiplex-5060   <none>           <none>c0-zookeeper-2                          1/1     Running     0          6m10s   10.244.87.49   liuxinle-optiplex-5060   <none>           <none>c0-zookeeper-initzk-7hbfs               0/1     Completed   0          7m35s   10.244.87.17   liuxinle-optiplex-5060   <none>           <none>

看到 c0-zookeeper-initzk-7hbfs 的状态是 Completed,其余 pod 的状态为 Running,示意集群搭建胜利。

3.2 在 GreatDB 分布式集群中应用 Chaos Mesh 做混沌测试

Chaos Mesh 在 kubernetes 环境反对注入的故障类型包含:模仿 Pod 故障、模仿网络故障、模仿压力场景等,这里咱们以模仿 Pod 故障中的 pod-kill 为例。

将试验配置写入到文件中 pod-kill.yaml,内容示例如下:

apiVersion: chaos-mesh.org/v1alpha1kind: PodChaos   *# 要注入的故障类型*metadata:  name: pod-failure-example  namespace: test   *# 测试集群pod所在的namespace*spec:  action: pod-kill   *# 要注入的具体故障类型*  mode: all    *# 指定试验的运行形式,all(示意选出所有符合条件的 Pod)*  duration: '30s'    *# 指定试验的持续时间*   selector:     labelSelectors:      "app.kubernetes.io/component": "greatdb"    *# 指定注入故障指标pod的标签,通过kubectl describe pod c0-greatdb-1 -n test 命令返回后果中Labels后的内容失去*

创立故障试验,命令如下:

kubectl create -n test -f pod-kill.yaml

创立完故障试验之后,执行命令 kubectl get pod -n test -o wide 后果如下:

NAME                                    READY   STATUS              RESTARTS   AGE     IP             NODE                     NOMINATED NODE   READINESS GATESc0-auth0-mysql-0                        2/2     Running             0          14m     10.244.87.18   liuxinle-optiplex-5060   <none>           <none>c0-auth0-mysql-1                        2/2     Running             0          14m     10.244.87.54   liuxinle-optiplex-5060   <none>           <none>c0-auth0-mysql-2                        2/2     Running             0          13m     10.244.87.57   liuxinle-optiplex-5060   <none>           <none>c0-greatdb-0                            0/2     ContainerCreating   0          2s      <none>         liuxinle-optiplex-5060   <none>           <none>c0-greatdb-1                            0/2     ContainerCreating   0          2s      <none>         liuxinle-optiplex-5060   <none>           <none>c0-glob0-mysql-0                        2/2     Running             0          14m     10.244.87.51   liuxinle-optiplex-5060   <none>           <none>c0-glob0-mysql-1                        2/2     Running             0          14m     10.244.87.41   liuxinle-optiplex-5060   <none>           <none>c0-glob0-mysql-2                        2/2     Running             0          13m     10.244.87.60   liuxinle-optiplex-5060   <none>           <none>c0-nor0-mysql-0                         2/2     Running             0          14m     10.244.87.29   liuxinle-optiplex-5060   <none>           <none>c0-nor0-mysql-1                         2/2     Running             0          14m     10.244.87.4    liuxinle-optiplex-5060   <none>           <none>c0-nor0-mysql-2                         2/2     Running             0          13m     10.244.87.25   liuxinle-optiplex-5060   <none>           <none>c0-par0-mysql-0                         2/2     Running             0          14m     10.244.87.55   liuxinle-optiplex-5060   <none>           <none>c0-par0-mysql-1                         2/2     Running             0          14m     10.244.87.13   liuxinle-optiplex-5060   <none>           <none>c0-par0-mysql-2                         2/2     Running             0          13m     10.244.87.21   liuxinle-optiplex-5060   <none>           <none>c0-prometheus-server-6697649b76-fkvh9   2/2     Running             0          9m24s   10.244.87.37   liuxinle-optiplex-5060   <none>           <none>c0-zookeeper-0                          1/1     Running             1          12m     10.244.87.44   liuxinle-optiplex-5060   <none>           <none>c0-zookeeper-1                          1/1     Running             0          11m     10.244.87.30   liuxinle-optiplex-5060   <none>           <none>c0-zookeeper-2                          1/1     Running             0          10m     10.244.87.49   liuxinle-optiplex-5060   <none>           <none>c0-zookeeper-initzk-7hbfs               0/1     Completed           0          12m     10.244.87.17   liuxinle-optiplex-5060   <none>           <none>

能够看到有带 greatdb 名字的 pod 正在被重启,阐明注入故障胜利。

4. 在argo中编排测试流程

Argo 是一个开源的容器本地工作流引擎,用于在 Kubernetes 上实现工作,能够将多步骤工作流建模为一系列工作,实现测试流程编排。

咱们应用 argo 定义一个测试工作,根本的测试流程是固定的,如下所示:

测试流程的 step1 是部署测试集群,接着开启两个并行任务,step2 跑测试用例,模仿业务场景,step3 同时应用 Chaos Mesh 注入故障,step2 的测试用例执行完结之后,step4 终止故障注入,最初 step5 清理集群环境。

4.1 用 argo 编排一个混沌测试工作流(以 flashback 测试用例为例

1)批改 cluster-setup.yaml 中的 image 信息,改成步骤 2.2 筹备测试须要的镜像中本人传上去的集群部署/清理镜像名和 tag

2)批改 testsuite-flashback.yaml 中的 image 信息,改成步骤 2.2 筹备测试须要的镜像中本人传上去的测试用例镜像名和 tag

3)将集群部署、测试用例和工具模板的 yaml 文件全副应用 kubectl apply -n argo -f xxx.yaml 命令创立资源 (这些文件定义了一些 argo template,不便用户写 workflow 时候应用)

kubectl apply -n argo -f cluster-setup.yamlkubectl apply -n argo -f testsuite-flashback.yamlkubectl apply -n argo -f tools-template.yaml

4)复制一份 workflow 模板文件 workflow-template.yaml,将模板文件中正文提醒的局部批改为本人的设置即可,而后执行以下命令创立混沌测试工作流:

kubectl apply -n argo -f workflow-template.yaml

以下是一份 workflow 模板文件:

apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata:  generateName: chaostest-c0-0-  name: chaostest-c0-0  namespace: argospec:  entrypoint: test-entry #测试入口,在这里传入测试参数,填写clustername、namespace、host、greatdb镜像名和tag名等根本信息  serviceAccountName: argo  arguments:    parameters:      - name: clustername        value: c0      - name: namespace        value: test      - name: host        value: 172.16.70.249      - name: port        value: 30901      - name: password        value: Bgview@2020      - name: user        value: root      - name: run-time        value: 10m      - name: greatdb-repository        value: greatdb/greatdb      - name: greatdb-tag        value: tag202110      - name: nemesis        value: kill_mysql_normal_master,kill_mysql_normal_slave,kill_mysql_partition_master,kill_mysql_partition_slave,kill_mysql_auth_master,kill_mysql_auth_slave,kill_mysql_global_master,kill_mysql_global_slave,kill_mysql_master,kill_mysql_slave,net_partition_mysql_normal,net_partition_mysql_partition,net_partition_mysql_auth,net_partition_mysql_global      - name: mysql-partition        value: 1      - name: mysql-global        value: 1      - name: mysql-auth        value: 1      - name: mysql-normal        value: 2  templates:    - name: test-entry      steps:        - - name: setup-greatdb-cluster  # step.1 集群部署. 请指定正确的参数,次要是mysql和zookeeper的镜像名、tag名            templateRef:              name: cluster-setup-template              template: cluster-setup            arguments:              parameters:                - name: namespace                  value: "{{workflow.parameters.namespace}}"                - name: clustername                  value: "{{workflow.parameters.clustername}}"                - name: mysql-image                  value: mysql:5.7.34                - name: mysql-replica                  value: 3                - name: mysql-auth                  value: "{{workflow.parameters.mysql-auth}}"                - name: mysql-normal                  value: "{{workflow.parameters.mysql-normal}}"                - name: mysql-partition                  value: "{{workflow.parameters.mysql-partition}}"                - name: mysql-global                  value: "{{workflow.parameters.mysql-global}}"                - name: enable-monitor                  value: false                - name: zookeeper-repository                  value: zookeeper                - name: zookeeper-tag                  value: 3.5.5                - name: zookeeper-replica                  value: 3                - name: greatdb-repository                  value: "{{workflow.parameters.greatdb-repository}}"                - name: greatdb-tag                  value: "{{workflow.parameters.greatdb-tag}}"                - name: greatdb-replica                  value: 3                - name: greatdb-serviceHost                  value: "{{workflow.parameters.host}}"                - name: greatdb-servicePort                  value: "{{workflow.parameters.port}}"        - - name: run-flashbacktest    # step.2 运行测试用例,请替换为你要运行的测试用例template并指定正确的参数,次要是测试应用的表个数和大小            templateRef:              name: flashback-test-template              template: flashback            arguments:              parameters:                - name: user                  value: "{{workflow.parameters.user}}"                - name: password                  value: "{{workflow.parameters.password}}"                - name: host                  value: "{{workflow.parameters.host}}"                - name: port                  value: "{{workflow.parameters.port}}"                - name: concurrency                  value: 16                - name: size                  value: 10000                - name: tables                  value: 10                - name: run-time                  value: "{{workflow.parameters.run-time}}"                - name: single-statement                  value: true                - name: manage-statement                  value: true          - name: invoke-chaos-for-flashabck-test    # step.3 注入故障,请指定正确的参数,这里run-time和interval别离定义了故障注入的工夫和频次,因而省略掉了终止故障注入步骤            templateRef:              name: chaos-rto-template              template: chaos-rto            arguments:              parameters:                - name: user                  value: "{{workflow.parameters.user}}"                - name: host                  value: "{{workflow.parameters.host}}"                - name: password                  value: "{{workflow.parameters.password}}"                - name: port                  value: "{{workflow.parameters.port}}"                - name: k8s-config                  value: /root/.kube/config                - name: namespace                  value: "{{workflow.parameters.namespace}}"                - name: clustername                  value: "{{workflow.parameters.clustername}}"                - name: prometheus                  value: ''                - name: greatdb-job                  value: greatdb-monitor-greatdb                - name: nemesis                  value: "{{workflow.parameters.nemesis}}"                - name: nemesis-duration                  value: 1m                - name: nemesis-mode                  value: default                - name: wait-time                  value: 5m                - name: check-time                  value: 5m                - name: nemesis-scope                  value: 1                - name: nemesis-log                  value: true                - name: enable-monitor                  value: false                - name: run-time                  value: "{{workflow.parameters.run-time}}"                - name: interval                  value: 1m                - name: monitor-log                  value: false                - name: enable-rto                  value: false                - name: rto-qps                  value: 0.1                - name: rto-warm                  value: 5m                - name: rto-time                  value: 1m                - name: log-level                  value: debug        - - name: flashbacktest-output         # 输入测试用例是否通过的后果            templateRef:              name: tools-template              template: output-result            arguments:              parameters:                - name: info                  value: "flashback test pass, with nemesis: {{workflow.parameters.nemesis}}"        - - name: clean-greatdb-cluster           # step.4 清理测试集群,这里的参数和step.1的参数统一            templateRef:              name: cluster-setup-template              template: cluster-setup            arguments:              parameters:                - name: namespace                  value: "{{workflow.parameters.namespace}}"                - name: clustername                  value: "{{workflow.parameters.clustername}}"                - name: mysql-image                  value: mysql:5.7                - name: mysql-replica                  value: 3                - name: mysql-auth                  value: "{{workflow.parameters.mysql-auth}}"                - name: mysql-normal                  value: "{{workflow.parameters.mysql-normal}}"                - name: mysql-partition                  value: "{{workflow.parameters.mysql-partition}}"                - name: mysql-global                  value: "{{workflow.parameters.mysql-global}}"                - name: enable-monitor                  value: false                - name: zookeeper-repository                  value: zookeeper                - name: zookeeper-tag                  value: 3.5.5                - name: zookeeper-replica                  value: 3                - name: greatdb-repository                  value: "{{workflow.parameters.greatdb-repository}}"                - name: greatdb-tag                  value: "{{workflow.parameters.greatdb-tag}}"                - name: greatdb-replica                  value: 3                - name: greatdb-serviceHost                  value: "{{workflow.parameters.host}}"                - name: greatdb-servicePort                  value: "{{workflow.parameters.port}}"                - name: clean                  value: true        - - name: echo-result            templateRef:              name: tools-template              template: echo            arguments:              parameters:                - name: info                  value: "{{item}}"            withItems:              - "{{steps.flashbacktest-output.outputs.parameters.result}}"

至此,你曾经胜利应用 Chaos Mesh 进行了一次混沌测试,并胜利验证了分布式系统的稳定性。

Now enjoy GreatSQL, and enjoy Chaos Mesh :)