关于mysql:技术分享-为什么MGR一致性模式不推荐AFTER

4次阅读

共计 15315 个字符,预计需要花费 39 分钟才能阅读完成。

  • GreatSQL 社区原创内容未经受权不得随便应用,转载请分割小编并注明起源。
  • 1、引子
  • 2、AFTER 的写一致性
  • 3、AFTER 的读一致性
  • 4、AFTER 执行流程
  • 5、BEFORE 执行流程
  • 6、一些思考
  • 7、参考文档

1、引子

某次测试过程中,发现在 AFTER 级别下,节点故障会导致集群无奈进行事务提交,同时,当事务进入提交阶段后,其它节点无奈开启只读事务。整个集群无奈失常提供服务,直到故障节点被踢出集群。

以下首先复现上述故障场景的步骤:

1、初始化一个 3 节点的集群。集群信息如下:

   mysql> select * from performance_schema.replication_group_members;
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | group_replication_applier | c223cde5-0719-11ec-8295-ec5c6826bca3 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
    | group_replication_applier | c22847e0-0719-11ec-b30b-ec5c6826bca3 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
    | group_replication_applier | c22c3752-0719-11ec-98ef-ec5c6826bca3 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    3 rows in set (0.00 sec)

    mysql> select @@group_replication_member_expel_timeout;
    +------------------------------------------+
    | @@group_replication_member_expel_timeout |
    +------------------------------------------+
    |                                     1000 |
    +------------------------------------------+
    1 row in set (0.00 sec)

2、在 AFTER 级别下创立表并插入一条数据

13000-conn1
    mysql> set session group_replication_consistency='AFTER';
    Query OK, 0 rows affected (0.00 sec)

    mysql> create table t1 (c1 int primary key, c2 int); 
    Query OK, 0 rows affected (0.12 sec)

    mysql> insert into t1 values (1,1); 
    Query OK, 1 row affected (0.03 sec)

3、强杀一个节点,因为设置的 expel_timeout 设置为 1000 秒,故障节点变成 UNREACHABLE 状态

    mysql> select * from performance_schema.replication_group_members;
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | group_replication_applier | c223cde5-0719-11ec-8295-ec5c6826bca3 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
    | group_replication_applier | c22847e0-0719-11ec-b30b-ec5c6826bca3 | 127.0.0.1   |       13004 | UNREACHABLE  | PRIMARY     | 8.0.25         |
    | group_replication_applier | c22c3752-0719-11ec-98ef-ec5c6826bca3 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    3 rows in set (0.00 sec)

4、此时,再次插入一条数据,无奈返回,语句处于期待提交阶段

    13000-conn1
    mysql> insert into t1 values (2,2);

    13000-conn2
    mysql> select time, state, info from information_schema.processlist;
    +------+--------------------------------------------------------+--------------------------------------------------------------+
    | time | state                                                  | info                                                         |
    +------+--------------------------------------------------------+--------------------------------------------------------------+
    |    0 | executing                                              | select time, state, info from information_schema.processlist |
    |  193 | waiting for handler commit                             | Group replication applier module                             |
    |  228 | Waiting on empty queue                                 | NULL                                                         |
    |  193 | Slave has read all relay log; waiting for more updates | NULL                                                         |
    |   50 | waiting for handler commit                             | insert into t1 values (2,2)                                  |
    +------+--------------------------------------------------------+--------------------------------------------------------------+
    5 rows in set (0.01 sec)

5、再次登录另一个活着的节点,无奈执行查问操作

13002-conn1
mysql> set @@group_replication_consistency='eventual';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;

13002-conn2
mysql> select time, state, info from information_schema.processlist;
+------+--------------------------------------+--------------------------------------------------------------+
| time | state                                | info                                                         |
+------+--------------------------------------+--------------------------------------------------------------+
|    0 | executing                            | select time, state, info from information_schema.processlist |
|  354 | waiting for handler commit           | Group replication applier module                             |
|  403 | Waiting on empty queue               | NULL                                                         |
|  225 | waiting for handler commit           | NULL                                                         |
|   13 | Executing hook on transaction begin. | select * from t1                                             |
+------+--------------------------------------+--------------------------------------------------------------+
5 rows in set (0.01 sec)

2、AFTER 的写一致性

上述故障的第一个问题是,在 MGR 集群中,当事务读写一致性级别设置为 AFTER 后,任何单点的故障都会导致集群不可用。在默认设置状况下,节点从故障到被踢出集群,个别须要 5 至 10 秒的工夫,在这段时间内,整个数据库是无奈进行写事务的提交的,当然,如上述测试个别,如果将 expel_timeout 设置的更大,则会有更长的工夫无奈失常工作。这对线上业务来说,将是一次劫难。

查看 MGR 读写一致性的原始 worklog,发现上述景象是合乎设计预期的,并不是一个 bug。该 worklog 的以下需要即解释了上述的问题。

FR-08: When group_replication_consistency=AFTER or BEFORE_AND_AFTER,
       if there are unreachable members but the group still has a
       reachable majority, the transaction will wait until that
       members are reachable or leave the group.

如此解决的一个益处是,能够满足局部业务对严格数据一致性的需要,但对个别的业务却是极度不敌对的。因而,对于个别的业务,并不举荐应用 AFTER 机制。然而,因为非 AFTER 机制下,事务音讯只是通过 paxos 协定在内存层面达成多数派,而并不要求数据落盘,因而,如果少数节点同时故障,是存在失落数据危险的。

3、AFTER 的读一致性

上述故障的第二个问题是,当节点 1 处于事务 commit 阶段过程中,在节点 2 上,甚至都无奈在 eventual 级别下开启一个只读事务。此时,在故障节点被踢出集群之前,节点 2 无奈提供任何的读写服务。

还是从 worklog 中,解释了这个问题。

FR-06: When group_replication_consistency=AFTER or BEFORE_AND_AFTER,
       since the remote ONLINE members do acknowledge the
       transaction on prepare, the new transactions on those members
       shall be held until the preceding prepared are committed.

也就是说,如果事务在 remote 节点进入了 prepared 阶段,则必须期待该事务实现提交能力开启新的事务,不管 consistency 是何种级别。然而如此解决的一个弊病是,AFTER 级别下,不光会导致执行节点性能吞吐升高,其它节点作为只读节点性能也会升高。

在多主写部署中,性能的影响可能更大。另一个问题是,如果用户歹意开启 AFTER 级别执行一个大事务操作,会导致其它节点长时间无奈开启新的事务。官网 worklog 中也提到这个问题。

SECURITY CONTEXT
================
From a point of view of malicious attack to the group, since when
group_replication_consistency=AFTER or BEFORE_AND_AFTER a
transaction will wait for a acknowledge from all ONLINE members, a
UNREACHABLE member will block a transaction execution until that
member is reachable or leaves the group.

A malicious user can set group_replication_consistency=AFTER or
BEFORE_AND_AFTER on long lived transactions, which may block new
transactions while those long lived transactions are being applied.

4、AFTER 执行流程

首先,在事务执行节点上的流程如下:

1. 首先,事务进入提交阶段后,会执行一个 before_commit 的 HOOK,在 mgr 中,对应的实现是 group_replication_trans_before_commit。AFTER 的一致性保障通过该接口实现。
2. 假如事务 T1 在节点 M1 上执行,如果是 AFTER 级别,会通过 paxos 发送一个携带事务全副数据的 Transaction_with_guarantee_message 音讯,音讯类型为 CT_TRANSACTION_WITH_GUARANTEE_MESSAGE。
3. 节点接管到该音讯并解决时,首先会获取以后集群中的 online_members。这里须要留神的是,即便节点状态变为 UNREACHABLE,只有没有踢出集群,也会认为是 online_members。
4. 节点 M1 须要期待其它节点的音讯反馈
5. 节点 M1 只有收到上述 online_members 中所有节点的 prepared 音讯时,能力持续实现提交

接下来,看一下其它节点(以 M2 节点为例)解决 AFTER 事务的流程:

1. 首先,paxos 接管到事务,并进入事务执行阶段
2. 事务 T1 在 M2 进入提交阶段时,调用 before_hook 进行解决,不同于 M1 的用户线程,M2 上的复制线程是在 GR_APPLIER_CHANNEL 上执行
3. 将事务退出到 prepared 事务列表
4. 发送 transaction_prepared 音讯给所有的节点,并期待解决
5. 接管到其它节点对 transaction_prepared 音讯确认后,从 prepared 事务列表中移除该事务,并持续提交

对于 AFTER 模式,所有的节点在处理事务时,均须要发送一个 transaction_prepared 音讯并期待所有节点的确认,之后,用户线程执行的事务能力胜利提交。排除用户线程期待所有节点事务提交的工夫开销,这些音讯解决的网络开销也会对性能造成肯定的影响。

另一个须要留神的是,如果在 M2 节点,事务 T1 还未进入 prepared 阶段,此时开启新的事务并不会阻塞。在 DEBUG 版本下,能够通过如下步骤进行验证。

connect 13000:mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 1ca5023b-0a1d-11ec-82f9-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cab1e2b-0a1d-11ec-9eb9-c8f7507e5048 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1caf096c-0a1d-11ec-a241-c8f7507e5048 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cbc3cf7-0a1d-11ec-955d-c8f7507e5048 | 127.0.0.1   |       13006 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cc6f5eb-0a1d-11ec-8e81-c8f7507e5048 | 127.0.0.1   |       13008 | ONLINE       | PRIMARY     | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
5 rows in set (0.03 sec)

mysql> set session group_replication_consistency='AFTER';
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (c1 int primary key, c2 int); 
 Query OK, 0 rows affected (0.17 sec)

mysql> insert into t1 values (1,1); 
Query OK, 1 row affected (0.07 sec)

kill -9 13008
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 1ca5023b-0a1d-11ec-82f9-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cab1e2b-0a1d-11ec-9eb9-c8f7507e5048 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1caf096c-0a1d-11ec-a241-c8f7507e5048 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cbc3cf7-0a1d-11ec-955d-c8f7507e5048 | 127.0.0.1   |       13006 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cc6f5eb-0a1d-11ec-8e81-c8f7507e5048 | 127.0.0.1   |       13008 | UNREACHABLE  | PRIMARY     | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
5 rows in set (0.00 sec)

connect 13002,应用 DEBUG_SYNC,管制节点不进入 prepared 阶段:mysql> set global debug='+d,group_replication_before_commit_hook_wait';
Query OK, 0 rows affected (0.00 sec)

connect 13000,插入新的事务,没有返回
mysql> insert into t1 values (2,2);

connect 13002,能够发现,事务被 DEBUG_SYNC 阻塞:mysql> select command, time, state, info from information_schema.processlist;
+---------+------+----------------------------+-----------------------------------------------------------------------+
| command | time | state                      | info                                                                  |
+---------+------+----------------------------+-----------------------------------------------------------------------+
| Connect |  189 | waiting for handler commit | Group replication applier module                                      |
| Query   |    0 | executing                  | select command, time, state, info from information_schema.processlist |
| Sleep   |    7 |                            | NULL                                                                  |
| Daemon  |  240 | Waiting on empty queue     | NULL                                                                  |
| Query   |   64 | debug sync point: now      | NULL                                                                  |
+---------+------+----------------------------+-----------------------------------------------------------------------+
5 rows in set (0.01 sec)

此时,能够查问到数据:mysql> set session group_replication_consistency='eventual';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;
+----+------+
| c1 | c2   |
+----+------+
|  1 |    1 |
+----+------+
1 row in set (0.00 sec)

5、BEFORE 执行流程

与 AFTER 绝对应的是 BEFORE,如果一个 session 开启了 BEFORE 级别,则在事务开启时,须要期待所有曾经提交的事务曾经在本地实现提交,这是通过 WAIT_FOR_GTID_SET 实现。在事务的开启时刻,获取到节点曾经同步接管到的所有的事务 gtid 汇合,并期待汇合内所有的 gtid 实现提交,即可保障事务执行时,读取到最新的数据。

然而在获取 gtid 汇合之前,节点须要通过 paxos 发送一个 SYNC_BEFORE_EXECUTION 类型的音讯。因为 paxos 会对音讯进行排队,因而,当 SYNC_BEFORE_EXECUTION 解决实现时,能够保障该音讯发送之前的所有的事务音讯均实现在 paxos 中的解决。因为该音讯是本次事务开启时产生的,因此此时节点收到的 gtid 汇合合乎 BEFORE 级别。

如果节点不发送 SYNC_BEFORE_EXECUTION 音讯,则 BEFORE 级别未必可能读取到最新数据。假如以后存在网络分区,总共三个节点 A,B,C,网络分区后,A,B 节点组成多数派,C 节点为少数派,此时,A,B 节点上新的写入事务将不会持续同步到 C 节点。在 C 节点被踢出集群之前,如果 C 开启了 BEFORE 级别,却未发送 SYNC_BEFORE_EXECUTION 音讯,那么 C 中不能读取到新的数据,违反了 BEFORE 的设计主旨。然而发送该音讯后,因为无奈达成音讯一致性,那么新的事务将失败、或者始终期待音讯返回,而不会返回用户过期的数据。

如下示例则显示了少数节点故障下,BEFORE 级别的执行行为。

 开始阶段:mysql> select * from performance_schema.replication_group_members; 
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | dd6398ec-09fe-11ec-95de-c8f7507e5048 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | dd64424b-09fe-11ec-aeeb-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | dd65b9de-09fe-11ec-9d06-c8f7507e5048 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.06 sec)

mysql> select @@group_replication_member_expel_timeout;
+------------------------------------------+
| @@group_replication_member_expel_timeout |
+------------------------------------------+
|                                       60 |
+------------------------------------------+
1 row in set (0.00 sec)

查问数据:mysql> set session group_replication_consistency='BEFORE';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;
+----+------+
| c1 | c2   |
+----+------+
|  2 |    2 |
+----+------+
1 row in set (0.01 sec)

应用 kill - 9 模仿少数节点故障:mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | dd6398ec-09fe-11ec-95de-c8f7507e5048 | 127.0.0.1   |       13002 | UNREACHABLE  | PRIMARY     | 8.0.25         |
| group_replication_applier | dd64424b-09fe-11ec-aeeb-c8f7507e5048 | 127.0.0.1   |       13000 | UNREACHABLE  | PRIMARY     | 8.0.25         |
| group_replication_applier | dd65b9de-09fe-11ec-9d06-c8f7507e5048 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)

此时,再次查问数据,始终未返回
mysql> select * from t1;

应用另一个客户端查看状态,始终处于 before hook 阶段:mysql> select command, time, state, info from information_schema.processlist;
+---------+------+--------------------------------------------------------+-----------------------------------------------------------------------+
| command | time | state                                                  | info                                                                  |
+---------+------+--------------------------------------------------------+-----------------------------------------------------------------------+
| Connect |  253 | waiting for handler commit                             | Group replication applier module                                      |
| Daemon  |  318 | Waiting on empty queue                                 | NULL                                                                  |
| Query   |  238 | Slave has read all relay log; waiting for more updates | NULL                                                                  |
| Query   |  170 | Executing hook on transaction begin.                   | select * from t1                                                      |
| Query   |    0 | executing                                              | select command, time, state, info from information_schema.processlist |
+---------+------+--------------------------------------------------------+-----------------------------------------------------------------------+
5 rows in set (0.00 sec)

当然,因为 BEFORE 级别下,额定发送了一次 SYNC_BEFORE_EXECUTION 音讯,减少了一次网络开销,对性能是有肯定的影响的。

6、一些思考

AFTER 模式下,当其它节点事务进入到 prepared 阶段,但用户线程并未实现提交,此时要求新的事务开启时必须期待事务提交是否必要?以后设计下,这一要求会造成肯定水平上的节点不可用。

AFTER 模式须要期待所有的节点提交胜利,这是一个强统一的数据同步计划,但同时会导致集群不可用。同时因为参数 group_replication_consistency 是一个 session 级的控制变量,即便某一个用户连贯开启 AFTER 模式,都可能导致整个集群不可用。一个备

选计划是,采纳开源的 GreatSQL 分支的 majority-after 模式(group_replication_majority_after_mode = ON),能够躲避上述问题。

7、参考文档

MySQL 设计文档(https://dev.mysql.com/worklog…)

MySQL 用户文档(https://dev.mysql.com/doc/ref…)

Enjoy GreatSQL :)

文章举荐:

GreatSQL MGR FAQ
https://mp.weixin.qq.com/s/J6…

万答 #12,MGR 整个集群挂掉后,如何能力主动选主,不必手动干涉
https://mp.weixin.qq.com/s/07…

『2021 数据技术嘉年华·ON LINE』:《MySQL 高可用架构演进及实际》
https://mp.weixin.qq.com/s/u7…

一条 sql 语句慢在哪之抓包剖析
https://mp.weixin.qq.com/s/AY…

万答 #15,都有哪些状况可能导致 MGR 服务无奈启动
https://mp.weixin.qq.com/s/in…

技术分享 | 为什么 MGR 一致性模式不举荐 AFTER
https://mp.weixin.qq.com/s/rN…

对于 GreatSQL

GreatSQL 是由万里数据库保护的 MySQL 分支,专一于晋升 MGR 可靠性及性能,反对 InnoDB 并行查问个性,是实用于金融级利用的 MySQL 分支版本。

Gitee:
https://gitee.com/GreatSQL/Gr…

GitHub:
https://github.com/GreatSQL/G…

Bilibili:
https://space.bilibili.com/13…

微信 &QQ 群:
可搜寻增加 GreatSQL 社区助手微信好友,发送验证信息“加群”退出 GreatSQL/MGR 交换微信群

QQ 群:533341697
微信小助手:wanlidbc

本文由博客一文多发平台 OpenWrite 公布!

正文完
 0