乐趣区

关于mysql:MySQL金融应用场景下跨数据中心的MGR架构方案2

  • GreatSQL 社区原创内容未经受权不得随便应用,转载请分割小编并注明起源。

如何在多个数据中心部署多套 MGR 集群,并实现故障疾速切换。

上篇文章介绍了如何在多数据中心部署多套 MGR 集群,并构建集群间的复制通道。这样一旦主 AZ 不可用时,在校验完数据后,就能够切换到备用 AZ 的 MGR 集群,十分不便。

本文咱们持续深刻介绍如何利用 Async Replication Auto failover 实现故障主动转移的。

1、什么是 Async Replication Auto failover

从 MySQL 8.0.22 开始,推出一个新个性 ”Async Replication Auto failover”,当 MGR 集群产生故障时,其从库能够更不便的实现疾速主动切主。直译过去是“异步复制主动故障转移”,但实际上它也是反对半同步复制场景的。

You can use MySQL Server's new asynchronous connection failover mechanism to automatically establish an asynchronous (source to replica) replication connection to a new source after the existing connection from a replica to its source fails. The connection fails over if the replication I/O thread stops due to the source stopping or due to a network failure. The asynchronous connection failover mechanism can be used to keep a replica synchronized with multiple MySQL servers or groups of servers that share data. To activate asynchronous connection failover for a replication channel set SOURCE_CONNECTION_AUTO_FAILOVER=1 on the CHANGE MASTER TO statement for the channel, and set up a source list for the channel using the asynchronous_connection_failover_add_source and asynchronous_connection_failover_delete_source functions.

具体介绍见官网文档 17.4.9 Switching Sources with Asynchronous Connection Failover

2、基于 MGR 的两地三核心数据库架构计划

在两地三核心架构下,能够采纳上面这个部署计划

在这个架构计划里,MGR- B 能够采纳 异步复制 或 加强半同步复制 通道从 MGR- A 复制数据,这要取决于两个 AZ 之间的网络情况。

在金融利用场景下,这个网络条件个别能够失去保障,因而优先采纳增强版同步形式。

而跨城异地 AZ 里的 MGR C 则因为网络提早较大,大概率会采纳异步复制形式。

在上述计划中,不论是 MGR- B 还是 C,都面临一个问题:那就是复制源指向的主机实例,产生故障不可用之后,如何疾速切换,实现主动故障转移。

在以往,只能靠第三方工具实现切换。

在 MySQL 8.0.22 新增 ”Async Replication Auto failover” 个性后,就没这个懊恼了。

其工作机制是 在一个复制通道上设置多个复制源(source),它还反对对多个源设置不同权重。当发现主复制源产生故障异常中断后(会先尝试重连几次),即可实现主动切换到新的复制源。当原来的复制源复原后,如果其权重更高,还会再切换回去。

3、配置 Async Replication Auto failover

部署的过程很简略,几条命令就搞定了。

3.1、创立复制通道

依照惯例形式,在从实例上(本案以 MGR- B 为例)创立一个复制通道

[root@GreatSQL mgrB-1][(none)]> CHANGE REPLICATION SOURCE TO
MASTER_HOST='172.16.16.10', MASTER_PORT=3306, 
MASTER_USER='repl', MASTER_PASSWORD='repl',
MASTER_AUTO_POSITION=1,
SOURCE_CONNECTION_AUTO_FAILOVER=1, #这里是要害,示意开启主动故障转移
MASTER_RETRY_COUNT=3, #最多重试 3 次
MASTER_CONNECT_RETRY=10 #每次重试距离 10 秒
FOR CHANNEL 'MGR-A';

#简略解释下几个参数
- SOURCE_CONNECTION_AUTO_FAILOVER=1  #这里是要害,示意开启主动故障转移
- MASTER_RETRY_COUNT=3  #示意最多重试 3 次,默认是是 86400 次
- MASTER_CONNECT_RETRY=10 #示意每次重试距离 10 秒,默认是 60 秒 

确认增加的复制通道失效了:

[root@GreatSQL mgrB-1][(none)]> SELECT * FROM performance_schema.replication_applier_status\G
*************************** 1. row ***************************
              CHANNEL_NAME: mgr-a
             SERVICE_STATE: ON
           REMAINING_DELAY: NULL
COUNT_TRANSACTIONS_RETRIES: 0
*************************** 2. row ***************************
              CHANNEL_NAME: group_replication_applier
             SERVICE_STATE: ON
           REMAINING_DELAY: NULL
COUNT_TRANSACTIONS_RETRIES: 0

3.2、对复制通道增加多个复制源

接下来再对这个复制通道增加多个复制源(屡次调用该 UDF 即可):

[root@GreatSQL mgrB-1][(none)]> SELECT asynchronous_connection_failover_add_source('MGR-A','172.16.16.10',3306,null,60);
[root@GreatSQL mgrB-1][(none)]> SELECT asynchronous_connection_failover_add_source('MGR-A','172.16.16.11',3306,null,60);
[root@GreatSQL mgrB-1][(none)]> SELECT asynchronous_connection_failover_add_source('MGR-A','172.16.16.12',3306,null,60);

简略解释下几个参数

  • MGR-A #示意复制通道,和下面的复制通道同名
  • 172.16.16.10 #示意该复制源的 IP
  • 3306 #示意该复制源的端口
  • null #示意 network_namespace,将来的个性,当初先放空即可
  • 60 #示意该复制源的权重,下面咱们介绍了不同权重的作用,值越大越有机会抢到成为复制源

确认多个复制源失效:

[root@GreatSQL mgrB-1][(none)]> SELECT * FROM performance_schema.replication_asynchronous_connection_failover;
+--------------+--------------+------+-------------------+--------+--------------+
| CHANNEL_NAME | HOST         | PORT | NETWORK_NAMESPACE | WEIGHT | MANAGED_NAME |
+--------------+--------------+------+-------------------+--------+--------------+
| mgr-a        | 172.16.16.10 | 3306 |                   |     60 |              |
| mgr-a        | 172.16.16.11 | 3306 |                   |     60 |              |
| mgr-a        | 172.16.16.12 | 3306 |                   |     60 |              |
+--------------+--------------+------+-------------------+--------+--------------+

启动该复制通道:

[root@GreatSQL mgrB-1][(none)]> START REPLICA FOR CHANNEL 'MGR-A';

确认复制通道和 MGR 的状态都失常:

[root@GreatSQL mgrB-1][(none)]> SELECT * FROM performance_schema.replication_connection_status\G
*************************** 1. row ***************************
                                      CHANNEL_NAME: mgr-a
                                        GROUP_NAME:
                                       SOURCE_UUID: b084f8a1-96a8-11eb-9a70-525400fb993a
                                         THREAD_ID: 3084
                                     SERVICE_STATE: ON
                         COUNT_RECEIVED_HEARTBEATS: 5974
                          LAST_HEARTBEAT_TIMESTAMP: 2021-05-29 18:53:13.879720
                          RECEIVED_TRANSACTION_SET: 476c0276-be03-11eb-bd34-525400e802e2:21-31:1000016-1000017
                                 LAST_ERROR_NUMBER: 0
                                LAST_ERROR_MESSAGE:
                              LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
                           LAST_QUEUED_TRANSACTION: 476c0276-be03-11eb-bd34-525400e802e2:31
 LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2021-05-27 17:19:43.201000
     LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 2021-05-27 17:19:43.203315
       LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 2021-05-27 17:19:43.203349
                              QUEUEING_TRANSACTION:
    QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
   QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
        QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
*************************** 2. row ***************************
                                      CHANNEL_NAME: group_replication_applier
                                        GROUP_NAME: f195537d-19ac-11eb-b29f-5254002eb6d6
                                       SOURCE_UUID: f195537d-19ac-11eb-b29f-5254002eb6d6
                                         THREAD_ID: NULL
                                     SERVICE_STATE: ON
                         COUNT_RECEIVED_HEARTBEATS: 0
                          LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00.000000
                          RECEIVED_TRANSACTION_SET: 476c0276-be03-11eb-bd34-525400e802e2:1-31:1000015-1000017,
f195537d-19ac-11eb-b29f-5254002eb6d6:1-18
                                 LAST_ERROR_NUMBER: 0
                                LAST_ERROR_MESSAGE:
                              LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
                           LAST_QUEUED_TRANSACTION: f195537d-19ac-11eb-b29f-5254002eb6d6:18
 LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
     LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 2021-05-27 17:04:03.407281
       LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 2021-05-27 17:04:03.407317
                              QUEUEING_TRANSACTION:
    QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
   QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
        QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000

执行 SHOW REPLICA STATUS 查看状态:

*************************** 1. row ***************************
             Replica_IO_State: Waiting for master to send event
                  Source_Host: 172.16.16.10
                  Source_User: repl
                  Source_Port: 3306
                Connect_Retry: 10
...
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
...
                  Source_UUID: 5499a6cb-91cb-11eb-966f-525400e802e2
...
           Source_Retry_Count: 3
...
           Retrieved_Gtid_Set: 476c0276-be03-11eb-bd34-525400e802e2:21-31:1000016-1000017
            Executed_Gtid_Set: 476c0276-be03-11eb-bd34-525400e802e2:1-31:1000015-1000017,
f195537d-19ac-11eb-b29f-5254002eb6d6:1-18
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: mgr-a
...               

先记住下面输入后果中的 Source_Host 和 Source_UUID 等信息,上面模仿一次复制源服务器宕机后,主动切换复制源的场景。

4、模仿故障,确认可主动切换

在以后复制源服务器上,执行 kill -9 杀掉 mysqld 过程,而后就能看到从服务器上有相似如下日志:

# 先尝试 3 次(每次距离 10 秒)重连旧的复制源服务器
[ERROR] [MY-010584] [Repl] Slave I/O for channel 'mgr-a': error connecting to master 'repl@172.16.16.10:3306' - retry-time: 10 retries: 3 message: Can't connect to MySQL server on'172.16.16.10:3306' (111), Error_code: MY-002003

#重试失败,进行复制 I / O 线程
[Note] [MY-010563] [Repl] Slave I/O thread for channel 'mgr-a' killed while connecting to master
[Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.

# 再次启动复制 I / O 线程,连贯到新的复制源服务器
[System] [MY-010562] [Repl] Slave I/O thread for channel 'mgr-a': connected to master 'repl@172.16.16.11:3306',replication started in log 'FIRST' at position 8598
# 告知 UUID 产生切换了
[Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was ec2fcbeb-976c-11eb-a652-525400e2078a.

再次执行 SHOW REPLICA STATUS 确认复制源切换了:

*************************** 1. row ***************************
             Replica_IO_State: Waiting for master to send event
                  Source_Host: 172.16.16.11
                  Source_User: repl
                  Source_Port: 3306
                Connect_Retry: 10
...
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
...
                  Source_UUID: ec2fcbeb-976c-11eb-a652-525400e2078a
...
           Source_Retry_Count: 3
...
           Retrieved_Gtid_Set: 476c0276-be03-11eb-bd34-525400e802e2:21-32:1000016-1000017
            Executed_Gtid_Set: 476c0276-be03-11eb-bd34-525400e802e2:1-32:1000015-1000017,
f195537d-19ac-11eb-b29f-5254002eb6d6:1-18
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: mgr-a
...

因为 3 个复制源的权重设置为一样,所以当原来的复制源服务器宕机复原后,不会再切换回去。而如果旧的复制源服务器权重设置较高的话,当他复原后,会再次发生切换,切回原来的源:

# 没有任何尝试重连的行为,间接发动切换
[Note] [MY-011026] [Repl] Slave I/O thread killed while reading event for channel 'mgr-a'.
[Note] [MY-010570] [Repl] Slave I/O thread exiting for channel 'mgr-a', read up to log 'FIRST', position 8871
[Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
[System] [MY-010562] [Repl] Slave I/O thread for channel 'mgr-a': connected to master 'repl@172.16.16.10:3306',replication started in log 'FIRST' at position 8871
[Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was 5499a6cb-91cb-11eb-966f-525400e802e2. -- 再次切回原来的主 

这就很不便的能够实现主动故障转移了。

当初,咱们利用 MGR + 加强半同步复制 + 主动故障转移 构建了一套金融级利用场景下的两地多核心数据库架构计划。举荐选用可靠性、稳定性更高的 GreatSQL,能够更释怀的应用 MGR(GreatSQL,打造更好的 MGR 生态)。

前面再持续介绍基于 MGR 的其余架构解决方案。

Enjoy GreatSQL :)

文章举荐:

GreatSQL MGR FAQ
https://mp.weixin.qq.com/s/J6…

万答 #12,MGR 整个集群挂掉后,如何能力主动选主,不必手动干涉
https://mp.weixin.qq.com/s/07…

『2021 数据技术嘉年华·ON LINE』:《MySQL 高可用架构演进及实际》
https://mp.weixin.qq.com/s/u7…

一条 sql 语句慢在哪之抓包剖析
https://mp.weixin.qq.com/s/AY…

万答 #15,都有哪些状况可能导致 MGR 服务无奈启动
https://mp.weixin.qq.com/s/in…

技术分享 | 为什么 MGR 一致性模式不举荐 AFTER
https://mp.weixin.qq.com/s/rN…

对于 GreatSQL

GreatSQL 是由万里数据库保护的 MySQL 分支,专一于晋升 MGR 可靠性及性能,反对 InnoDB 并行查问个性,是实用于金融级利用的 MySQL 分支版本。

Gitee:
https://gitee.com/GreatSQL/Gr…

GitHub:
https://github.com/GreatSQL/G…

Bilibili:
https://space.bilibili.com/13…

微信 &QQ 群:
可搜寻增加 GreatSQL 社区助手微信好友,发送验证信息“加群”退出 GreatSQL/MGR 交换微信群

QQ 群:533341697
微信小助手:wanlidbc

本文由博客一文多发平台 OpenWrite 公布!

退出移动版