TiDB_Robot
作者TiDB_Robot2022-03-31 14:05
数据库研发工程师, PingCAP

Chaos Mesh 实战分享丨通过混沌工程验证 GreatDB 分布式部署模式的稳定性

字数 15810阅读 611评论 0赞 0

Chaos Mesh 最初作为开源分布式数据库 TiDB 的测试平台而创建,是一个多功能混沌工程平台,通过混沌测试验证分布式系统的稳定性。本文以万里安全数据库软件 GreatDB 分布式部署模式为例,介绍了通过 Chaos Mesh 进行混沌测试的全流程。

需求背景与 GreatDB 介绍

需求背景

混沌测试是检测分布式系统不确定性、建立系统弹性信心的一种非常好的方式,因此我们采用开源工具 Chaos Mesh 来做 GreatDB 分布式集群的混沌测试。

GreatDB 分布式部署模式介绍

万里安全数据库软件 GreatDB 是一款关系型数据库软件,同时支持集中式和分布式的部署方式,本文涉及的是分布式部署方式。

分布式部署模式采用 shared-nothing 架构;通过数据冗余与副本管理确保数据库无单点故障;数据 sharding 与分布式并行计算实现数据库系统高性能;可无限制动态扩展数据节点,满足业务需要。

整体架构如下图所示:

1.jpg

1.jpg

环境准备

Chaos Mesh 安装

在安装 Chaos Mesh 之前请确保已经预先安装了 helm,docker,并准备好了一个 kubernetes 环境。

  • 使用 Helm 安装

1)在 Helm 仓库中添加 Chaos Mesh 仓库:


helm repo add chaos-mesh https://charts.chaos-mesh.org

2)查看可以安装的 Chaos Mesh 版本:


helm search repo chaos-mesh

3)创建安装 Chaos Mesh 的命名空间:


kubectl create ns chaos-testing

4)在 docker 环境下安装 Chaos Mesh:


helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing
  • 验证安装

执行以下命令查看 Chaos Mesh 的运行情况:

kubectl get pod -n chaos-testing

下面是预期输出:


NAME READY STATUS RESTARTS AGE

chaos-controller-manager-d7bc9ccb5-dbccq 1/1 Running 0 26d

chaos-daemon-pzxc7 1/1 Running 0 26d

chaos-dashboard-5887f7559b-kgz46 1/1 Running 1 26d

如果 3 个 pod 的状态都是 Running,表示 Chaos Mesh 已经成功安装。

准备测试需要的镜像

准备 MySQL 镜像

一般情况下,MySQL 使用官方 5.7 版本的镜像,MySQL 监控采集器使用的是 mysqld-exporter,也可以直接从 docker hub 下载:


docker pull mysql:5.7

docker pull prom/mysqld-exporter

准备 ZooKeeper 镜像

ZooKeeper 使用的是官方 3.5.5 版本镜像,ZooKeeper 组件涉及的监控有 jmx-prometheus-exporter 和 zookeeper-exporter,均从 docker hub 下载:


docker pull zookeeper:3.5.5

docker pull sscaling/jmx-prometheus-exporter

docker pull josdotso/zookeeper-exporter

准备 GreatDB 镜像

选择一个 GreatDB 的 tar 包,将其解压得到一个 ./greatdb 目录,再将 greatdb-service-docker.sh 文件拷贝到这个解压出来的./greatdb 目录里:


cp greatdb-service-docker.sh ./greatdb/

将 greatdb Dockerfile 放到./greatdb 文件夹的同级目录下,然后执行以下命令构建 GreatDB 镜像:


docker build -t greatdb/greatdb:tag2021 .

准备 GreatDB 分布式集群部署/清理的镜像

下载集群部署脚本 cluster-setup,集群初始化脚本 init-zk 以及集群 helm charts 包(可咨询 4.0 开发/测试组获取)

将上述材料放在同一目录下,编写如下 Dockerfile:


FROM debian:buster-slim as init-zk

COPY ./init-zk /root/init-zk

RUN chmod +x /root/init-zk

FROM debian:buster-slim as cluster-setup

\*# Set aliyun repo for speed*

RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list && \\

sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list

RUN apt-get -y update && \\

apt-get -y install \\

curl \\

wget

RUN curl -L https://storage.googleapis.com/kubernetes-release/release/v1.20.1/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && \\

chmod +x /usr/local/bin/kubectl && \\

mkdir /root/.kube && \\

wget https://get.helm.sh/helm-v3.5.3-linux-amd64.tar.gz && \\

tar -zxvf helm-v3.5.3-linux-amd64.tar.gz && \\

mv linux-amd64/helm /usr/local/bin/helm

COPY ./config /root/.kube/

COPY ./helm /helm

COPY ./cluster-setup /

执行以下命令构建所需镜像:


docker build --target init-zk -t greatdb/initzk:latest .

docker build --target cluster-setup -t greatdb/cluster-setup:v1 .

准备测试用例的镜像

目前测试支持的用例有:bank、bank2、pbank、tpcc、flashback 等,每个用例都是一个可执行文件。

以 flashback 测例为例构建测试用例的镜像,先将用例下载到本地,在用例的同一目录下编写如下内容的 Dockerfile:


FROM debian:buster-slim

COPY ./flashback /

RUN cd / && chmod +x ./flashback

执行以下命令构建测试用例镜像:


docker build -t greatdb/testsuite-flashback:v1 .

将准备好的镜像上传到私有仓库中

创建私有仓库和上传镜像操作请参考:

Chaos Mesh 的使用

搭建 GreatDB 分布式集群

在上一章中 cluster-setup 目录下执行以下命令块去搭建测试集群:


./cluster-setup \\

-clustername=c0 \\

-namespace=test \\

-enable-monitor=true \\

-mysql-image=mysql:5.7 \\

-mysql-replica=3 \\

-mysql-auth=1 \\

-mysql-normal=1 \\

-mysql-global=1 \\

-mysql-partition=1 \\

-zookeeper-repository=zookeeper \\

-zookeeper-tag=3.5.5 \\

-zookeeper-replica=3 \\

-greatdb-repository=greatdb/greatdb \\

-greatdb-tag=tag202110 \\

-greatdb-replica=3 \\

-greatdb-serviceHost=172.16.70.249

输出信息:


liuxinle@liuxinle-OptiPlex-5060:~/k8s/cluster-setup$ ./cluster-setup \\

\\> -clustername=c0 \\

\\> -namespace=test \\

\\> -enable-monitor=true \\

\\> -mysql-image=mysql:5.7 \\

\\> -mysql-replica=3 \\

\\> -mysql-auth=1 \\

\\> -mysql-normal=1 \\

\\> -mysql-global=1 \\

\\> -mysql-partition=1 \\

\\> -zookeeper-repository=zookeeper \\

\\> -zookeeper-tag=3.5.5 \\

\\> -zookeeper-replica=3 \\

\\> -greatdb-repository=greatdb/greatdb \\

\\> -greatdb-tag=tag202110 \\

\\> -greatdb-replica=3 \\

\\> -greatdb-serviceHost=172.16.70.249

INFO[2021-10-14T10:41:52+08:00] SetUp the cluster ... NameSpace=test

INFO[2021-10-14T10:41:52+08:00] create namespace ...

INFO[2021-10-14T10:41:57+08:00] copy helm chart templates ...

INFO[2021-10-14T10:41:57+08:00] setup ... Component=MySQL

INFO[2021-10-14T10:41:57+08:00] exec helm install and update greatdb-cfg.yaml ...

INFO[2021-10-14T10:42:00+08:00] waiting mysql pods running ...

INFO[2021-10-14T10:44:27+08:00] setup ... Component=Zookeeper

INFO[2021-10-14T10:44:28+08:00] waiting zookeeper pods running ...

INFO[2021-10-14T10:46:59+08:00] update greatdb-cfg.yaml

INFO[2021-10-14T10:46:59+08:00] setup ... Component=greatdb

INFO[2021-10-14T10:47:00+08:00] waiting greatdb pods running ...

INFO[2021-10-14T10:47:21+08:00] waiting cluster running ...

INFO[2021-10-14T10:47:27+08:00] waiting prometheus server running...

INFO[2021-10-14T10:47:27+08:00] Dump Cluster Info

INFO[2021-10-14T10:47:27+08:00] SetUp success. ClusterName=c0 NameSpace=test

执行如下命令,查看集群 pod 状态:


kubectl get pod -n test -o wide

输出信息:


NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

c0-auth0-mysql-0 2/2 Running 0 10m 10.244.87.18 liuxinle-optiplex-5060  

c0-auth0-mysql-1 2/2 Running 0 9m23s 10.244.87.54 liuxinle-optiplex-5060  

c0-auth0-mysql-2 2/2 Running 0 8m39s 10.244.87.57 liuxinle-optiplex-5060  

c0-greatdb-0 2/2 Running 1 5m3s 10.244.87.58 liuxinle-optiplex-5060  

c0-greatdb-1 2/2 Running 0 4m57s 10.244.87.20 liuxinle-optiplex-5060  

c0-greatdb-2 2/2 Running 0 4m50s 10.244.87.47 liuxinle-optiplex-5060  

c0-glob0-mysql-0 2/2 Running 0 10m 10.244.87.51 liuxinle-optiplex-5060  

c0-glob0-mysql-1 2/2 Running 0 9m23s 10.244.87.41 liuxinle-optiplex-5060  

c0-glob0-mysql-2 2/2 Running 0 8m38s 10.244.87.60 liuxinle-optiplex-5060  

c0-nor0-mysql-0 2/2 Running 0 10m 10.244.87.29 liuxinle-optiplex-5060  

c0-nor0-mysql-1 2/2 Running 0 9m29s 10.244.87.4 liuxinle-optiplex-5060  

c0-nor0-mysql-2 2/2 Running 0 8m45s 10.244.87.25 liuxinle-optiplex-5060  

c0-par0-mysql-0 2/2 Running 0 10m 10.244.87.55 liuxinle-optiplex-5060  

c0-par0-mysql-1 2/2 Running 0 9m26s 10.244.87.13 liuxinle-optiplex-5060  

c0-par0-mysql-2 2/2 Running 0 8m42s 10.244.87.21 liuxinle-optiplex-5060  

c0-prometheus-server-6697649b76-fkvh9 2/2 Running 0 4m36s 10.244.87.37 liuxinle-optiplex-5060  

c0-zookeeper-0 1/1 Running 1 7m35s 10.244.87.44 liuxinle-optiplex-5060  

c0-zookeeper-1 1/1 Running 0 6m41s 10.244.87.30 liuxinle-optiplex-5060  

c0-zookeeper-2 1/1 Running 0 6m10s 10.244.87.49 liuxinle-optiplex-5060  

c0-zookeeper-initzk-7hbfs 0/1 Completed 0 7m35s 10.244.87.17 liuxinle-optiplex-5060  

看到 c0-zookeeper-initzk-7hbfs 的状态是 Completed,其他 pod 的状态为 Running,表示集群搭建成功。

在 GreatDB 分布式集群中使用 Chaos Mesh 做混沌测试

Chaos Mesh 在 kubernetes 环境支持注入的故障类型包括:模拟 Pod 故障、模拟网络故障、模拟压力场景等,这里我们以模拟 Pod 故障中的 pod-kill 为例。

将实验配置写入到文件中 pod-kill.yaml,内容示例如下:


apiVersion: chaos-mesh.org/v1alpha1

kind: PodChaos *# 要注入的故障类型*

metadata:

name: pod-failure-example

namespace: test *# 测试集群pod所在的namespace*

spec:

action: pod-kill *# 要注入的具体故障类型*

mode: all *# 指定实验的运行方式,all(表示选出所有符合条件的 Pod)*

duration: '30s' *# 指定实验的持续时间*

selector:

labelSelectors:

"app.kubernetes.io/component": "greatdb" *# 指定注入故障目标pod的标签,通过kubectl describe pod c0-greatdb-1 -n test 命令返回结果中Labels后的内容得到*

创建故障实验,命令如下:


kubectl create -n test -f pod-kill.yaml

创建完故障实验之后,执行命令 kubectl get pod -n test -o wide 结果如下:


NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

c0-auth0-mysql-0 2/2 Running 0 14m 10.244.87.18 liuxinle-optiplex-5060  

c0-auth0-mysql-1 2/2 Running 0 14m 10.244.87.54 liuxinle-optiplex-5060  

c0-auth0-mysql-2 2/2 Running 0 13m 10.244.87.57 liuxinle-optiplex-5060  

c0-greatdb-0 0/2 ContainerCreating 0 2s  liuxinle-optiplex-5060  

c0-greatdb-1 0/2 ContainerCreating 0 2s  liuxinle-optiplex-5060  

c0-glob0-mysql-0 2/2 Running 0 14m 10.244.87.51 liuxinle-optiplex-5060  

c0-glob0-mysql-1 2/2 Running 0 14m 10.244.87.41 liuxinle-optiplex-5060  

c0-glob0-mysql-2 2/2 Running 0 13m 10.244.87.60 liuxinle-optiplex-5060  

c0-nor0-mysql-0 2/2 Running 0 14m 10.244.87.29 liuxinle-optiplex-5060  

c0-nor0-mysql-1 2/2 Running 0 14m 10.244.87.4 liuxinle-optiplex-5060  

c0-nor0-mysql-2 2/2 Running 0 13m 10.244.87.25 liuxinle-optiplex-5060  

c0-par0-mysql-0 2/2 Running 0 14m 10.244.87.55 liuxinle-optiplex-5060  

c0-par0-mysql-1 2/2 Running 0 14m 10.244.87.13 liuxinle-optiplex-5060  

c0-par0-mysql-2 2/2 Running 0 13m 10.244.87.21 liuxinle-optiplex-5060  

c0-prometheus-server-6697649b76-fkvh9 2/2 Running 0 9m24s 10.244.87.37 liuxinle-optiplex-5060  

c0-zookeeper-0 1/1 Running 1 12m 10.244.87.44 liuxinle-optiplex-5060  

c0-zookeeper-1 1/1 Running 0 11m 10.244.87.30 liuxinle-optiplex-5060  

c0-zookeeper-2 1/1 Running 0 10m 10.244.87.49 liuxinle-optiplex-5060  

c0-zookeeper-initzk-7hbfs 0/1 Completed 0 12m 10.244.87.17 liuxinle-optiplex-5060  

可以看到有带 greatdb 名字的 pod 正在被重启,说明注入故障成功。

在 Argo 中编排测试流程

Argo 是一个开源的容器本地工作流引擎,用于在 Kubernetes 上完成工作,可以将多步骤工作流建模为一系列任务,完成测试流程编排。

我们使用 argo 定义一个测试任务,基本的测试流程是固定的,如下所示:

2.jpg

2.jpg

测试流程的 step1 是部署测试集群,接着开启两个并行任务,step2 跑测试用例,模拟业务场景,step3 同时使用 Chaos Mesh 注入故障,step2 的测试用例执行结束之后,step4 终止故障注入,最后 step5 清理集群环境。

用 Argo 编排一个混沌测试工作流(以 flashback 测试用例为例)

1)修改 cluster-setup.yaml 中的 image 信息,改成步骤“准备测试需要的镜像”中自己传上去的集群部署/清理镜像名和 tag

2)修改 testsuite-flashback.yaml 中的 image 信息,改成步骤“准备测试需要的镜像”中自己传上去的测试用例镜像名和 tag

3)将集群部署、测试用例和工具模板的 yaml 文件全部使用 kubectl apply -n argo -f xxx.yaml 命令创建资源 (这些文件定义了一些 argo template,方便用户写 workflow 时候使用)


kubectl apply -n argo -f cluster-setup.yaml

kubectl apply -n argo -f testsuite-flashback.yaml

kubectl apply -n argo -f tools-template.yaml

4)复制一份 workflow 模板文件 workflow-template.yaml,将模板文件中注释提示的部分修改为自己的设置即可,然后执行以下命令创建混沌测试工作流:


kubectl apply -n argo -f workflow-template.yaml

以下是一份 workflow 模板文件:


apiVersion: argoproj.io/v1alpha1

kind: Workflow

metadata:

generateName: chaostest-c0-0-

name: chaostest-c0-0

namespace: argo

spec:

entrypoint: test-entry #测试入口,在这里传入测试参数,填写clustername、namespace、host、greatdb镜像名和tag名等基本信息

serviceAccountName: argo

arguments:

parameters:

- name: clustername

value: c0

- name: namespace

value: test

- name: host

value: 172.16.70.249

- name: port

value: 30901

- name: password

value: Bgview@2020

- name: user

value: root

- name: run-time

value: 10m

- name: greatdb-repository

value: greatdb/greatdb

- name: greatdb-tag

value: tag202110

- name: nemesis

value: kill_mysql_normal_master,kill_mysql_normal_slave,kill_mysql_partition_master,kill_mysql_partition_slave,kill_mysql_auth_master,kill_mysql_auth_slave,kill_mysql_global_master,kill_mysql_global_slave,kill_mysql_master,kill_mysql_slave,net_partition_mysql_normal,net_partition_mysql_partition,net_partition_mysql_auth,net_partition_mysql_global

- name: mysql-partition

value: 1

- name: mysql-global

value: 1

- name: mysql-auth

value: 1

- name: mysql-normal

value: 2

templates:

- name: test-entry

steps:

- - name: setup-greatdb-cluster # step.1 集群部署. 请指定正确的参数,主要是mysql和zookeeper的镜像名、tag名

templateRef:

name: cluster-setup-template

template: cluster-setup

arguments:

parameters:

- name: namespace

value: "{{workflow.parameters.namespace}}"

- name: clustername

value: "{{workflow.parameters.clustername}}"

- name: mysql-image

value: mysql:5.7.34

- name: mysql-replica

value: 3

- name: mysql-auth

value: "{{workflow.parameters.mysql-auth}}"

- name: mysql-normal

value: "{{workflow.parameters.mysql-normal}}"

- name: mysql-partition

value: "{{workflow.parameters.mysql-partition}}"

- name: mysql-global

value: "{{workflow.parameters.mysql-global}}"

- name: enable-monitor

value: false

- name: zookeeper-repository

value: zookeeper

- name: zookeeper-tag

value: 3.5.5

- name: zookeeper-replica

value: 3

- name: greatdb-repository

value: "{{workflow.parameters.greatdb-repository}}"

- name: greatdb-tag

value: "{{workflow.parameters.greatdb-tag}}"

- name: greatdb-replica

value: 3

- name: greatdb-serviceHost

value: "{{workflow.parameters.host}}"

- name: greatdb-servicePort

value: "{{workflow.parameters.port}}"

- - name: run-flashbacktest # step.2 运行测试用例,请替换为你要运行的测试用例template并指定正确的参数,主要是测试使用的表个数和大小

templateRef:

name: flashback-test-template

template: flashback

arguments:

parameters:

- name: user

value: "{{workflow.parameters.user}}"

- name: password

value: "{{workflow.parameters.password}}"

- name: host

value: "{{workflow.parameters.host}}"

- name: port

value: "{{workflow.parameters.port}}"

- name: concurrency

value: 16

- name: size

value: 10000

- name: tables

value: 10

- name: run-time

value: "{{workflow.parameters.run-time}}"

- name: single-statement

value: true

- name: manage-statement

value: true

- name: invoke-chaos-for-flashabck-test # step.3 注入故障,请指定正确的参数,这里run-time和interval分别定义了故障注入的时间和频次,因此省略掉了终止故障注入步骤

templateRef:

name: chaos-rto-template

template: chaos-rto

arguments:

parameters:

- name: user

value: "{{workflow.parameters.user}}"

- name: host

value: "{{workflow.parameters.host}}"

- name: password

value: "{{workflow.parameters.password}}"

- name: port

value: "{{workflow.parameters.port}}"

- name: k8s-config

value: /root/.kube/config

- name: namespace

value: "{{workflow.parameters.namespace}}"

- name: clustername

value: "{{workflow.parameters.clustername}}"

- name: prometheus

value: ''

- name: greatdb-job

value: greatdb-monitor-greatdb

- name: nemesis

value: "{{workflow.parameters.nemesis}}"

- name: nemesis-duration

value: 1m

- name: nemesis-mode

value: default

- name: wait-time

value: 5m

- name: check-time

value: 5m

- name: nemesis-scope

value: 1

- name: nemesis-log

value: true

- name: enable-monitor

value: false

- name: run-time

value: "{{workflow.parameters.run-time}}"

- name: interval

value: 1m

- name: monitor-log

value: false

- name: enable-rto

value: false

- name: rto-qps

value: 0.1

- name: rto-warm

value: 5m

- name: rto-time

value: 1m

- name: log-level

value: debug

- - name: flashbacktest-output # 输出测试用例是否通过的结果

templateRef:

name: tools-template

template: output-result

arguments:

parameters:

- name: info

value: "flashback test pass, with nemesis: {{workflow.parameters.nemesis}}"

- - name: clean-greatdb-cluster # step.4 清理测试集群,这里的参数和step.1的参数一致

templateRef:

name: cluster-setup-template

template: cluster-setup

arguments:

parameters:

- name: namespace

value: "{{workflow.parameters.namespace}}"

- name: clustername

value: "{{workflow.parameters.clustername}}"

- name: mysql-image

value: mysql:5.7

- name: mysql-replica

value: 3

- name: mysql-auth

value: "{{workflow.parameters.mysql-auth}}"

- name: mysql-normal

value: "{{workflow.parameters.mysql-normal}}"

- name: mysql-partition

value: "{{workflow.parameters.mysql-partition}}"

- name: mysql-global

value: "{{workflow.parameters.mysql-global}}"

- name: enable-monitor

value: false

- name: zookeeper-repository

value: zookeeper

- name: zookeeper-tag

value: 3.5.5

- name: zookeeper-replica

value: 3

- name: greatdb-repository

value: "{{workflow.parameters.greatdb-repository}}"

- name: greatdb-tag

value: "{{workflow.parameters.greatdb-tag}}"

- name: greatdb-replica

value: 3

- name: greatdb-serviceHost

value: "{{workflow.parameters.host}}"

- name: greatdb-servicePort

value: "{{workflow.parameters.port}}"

- name: clean

value: true

- - name: echo-result

templateRef:

name: tools-template

template: echo

arguments:

parameters:

- name: info

value: "{{item}}"

withItems:

- "{{steps.flashbacktest-output.outputs.parameters.result}}"

至此,你已经成功使用 Chaos Mesh 进行了一次混沌测试,并成功验证了分布式系统的稳定性。

Now enjoy GreatSQL, and enjoy Chaos Mesh :)

如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!

0

添加新评论0 条评论

Ctrl+Enter 发表

作者其他文章

相关文章

相关问题

相关资料

X社区推广