共计 8967 个字符,预计需要花费 23 分钟才能阅读完成。
一、背景
分享一个在我的项目运维中遇到的一个主从复制限度的一个坑,我的项目的架构为主集群 + 灾备集群,每个集群为一主两从模式。主集群到灾备集群的同步为主从复制的形式,依据业务需要灾备集群须要疏忽零碎库跟某些配置表,所以才会触发此限度,而这个限度如果咱们之前没有遇到过,那么排查起来也是绝对不易的。
二、限度形容
1、主从同步呈现报错
greatsql> show slave status\G | |
*************************** 1. row *************************** | |
Slave_IO_State: Waiting for master to send event | |
Master_Host: 192.168.xxx.xxx | |
Master_User: repl | |
Master_Port: 3306 | |
Connect_Retry: 60 | |
Master_Log_File: greatsql-bin.000990 | |
Read_Master_Log_Pos: 92274290 | |
Relay_Log_File: greatsql-relay.002963 ----- | |
Relay_Log_Pos: 701548899 | |
Relay_Master_Log_File: greatsql-bin.000988 | |
Slave_IO_Running: Yes | |
Slave_SQL_Running: No | |
Replicate_Do_DB: | |
Replicate_Ignore_DB: mysql,dbscale,dbscale_tmp,information_schema,performance_schema,sys | |
Replicate_Do_Table: | |
Replicate_Ignore_Table: | |
Replicate_Wild_Do_Table: | |
Replicate_Wild_Ignore_Table: A.ab,B.bc | |
Last_Errno: 1146 | |
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, end_log_pos 701570116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. | |
Skip_Counter: 0 | |
Exec_Master_Log_Pos: 701548690 | |
Relay_Log_Space: 2246320360 | |
Until_Condition: None | |
Until_Log_File: | |
Until_Log_Pos: 0 | |
Master_SSL_Allowed: No | |
Master_SSL_CA_File: | |
Master_SSL_CA_Path: | |
Master_SSL_Cert: | |
Master_SSL_Cipher: | |
Master_SSL_Key: | |
Seconds_Behind_Master: NULL | |
Master_SSL_Verify_Server_Cert: No | |
Last_IO_Errno: 0 | |
Last_IO_Error: | |
Last_SQL_Errno: 1146 | |
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, end_log_pos 701570116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. | |
Replicate_Ignore_Server_Ids: | |
Master_Server_Id: 1943306 | |
Master_UUID: 9e668a93-2618-11ee-93ee-bc16954181bb | |
Master_Info_File: mysql.slave_master_info | |
SQL_Delay: 0 | |
SQL_Remaining_Delay: NULL | |
Slave_SQL_Running_State: | |
Master_Retry_Count: 86400 | |
Master_Bind: | |
Last_IO_Error_Timestamp: | |
Last_SQL_Error_Timestamp: 230822 14:14:18 | |
Master_SSL_Crl: | |
Master_SSL_Crlpath: | |
Retrieved_Gtid_Set: 9e668a93-2618-11ee-93ee-bc16954181bb:2-47565802 | |
Executed_Gtid_Set: 30873cfe-8750-11ed-b56f-744aa4073024:1-270, | |
9e668a93-2618-11ee-93ee-bc16954181bb:1-47508256 | |
Auto_Position: 1 | |
Replicate_Rewrite_DB: | |
Channel_Name: | |
Master_TLS_Version: | |
1 row in set (0.00 sec) |
依据 slave status 状态信息能够看出
- 报错的 GTID 为:
'9e668a93-2618-11ee-93ee-bc16954181bb:47508257'
- 利用的主集群的 binlog 为:
greatsql-bin.000988
- 灾备集群的 relay log 为:
greatsql-relay.002963
详细信息查看 performance_schema.replication_applier_status_by_worker
表
2、查看谬误的详细信息
greatsql> select * from performance_schema.replication_applier_status_by_worker\G | |
*************************** 1. row *************************** | |
CHANNEL_NAME: | |
WORKER_ID: 1 | |
THREAD_ID: NULL | |
SERVICE_STATE: OFF | |
LAST_SEEN_TRANSACTION: 9e668a93-2618-11ee-93ee-bc16954181bb:47508257 | |
LAST_ERROR_NUMBER: 1146 | |
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction | |
'9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, | |
end_log_pos 701570116; Error executing row event: 'Table'abs_xxx.tmp_xxx_info'doesn't exist' | |
LAST_ERROR_TIMESTAMP: 2023-08-22 14:14:18 |
上述信息阐明依据 performance_schema.replication_applier_status_by_worker
表中的具体错误信息能够发现为灾备集群 abs_xxx.tmp_xxx_info
表不存在,导致同步报错
3、问题剖析
3.1、确认灾备集群中指标表是否存在
greatsql> show create table abs_xxx.tmp_xxx_info; | |
ERROR 1146 (42S02): Table 'abs_xxx.tmp_xxx_info' doesn't exist | |
greatsql> desc abs_xxx.tmp_xxx_info; | |
ERROR 1146 (42S02): Table 'abs_xxx.tmp_xxx_info' doesn't exist |
论断:灾备集群中指标表确实不存在
3.2、依据主从报错信息解析主集群 binlog,报错的 SQL
解析主集群 binlog
SET @@SESSION.GTID_NEXT= '9e668a93-2618-11ee-93ee-bc16954181bb:47508257'/*!*/; | |
…… | |
#230822 14:14:18 server id 1943306 end_log_pos 701570000 Table_map: `abs_xxx`.`tmp_xxx_info` mapped to number 1595 | |
# at 701570000 | |
#230822 14:14:18 server id 1943306 end_log_pos 701570116 Write_rows: table id 1595 flags: STMT_END_F | |
### INSERT INTO `abs_xxx`.`tmp_xxx_info` | |
### SET | |
### @1=2 | |
### @2='自动化' | |
### @3='2300121212120000' | |
### @4='90000000' | |
### @5='1' | |
### @6='202001290231001' | |
### @7='2021-01-31 00:00:00' | |
# at 701570116 | |
#230822 14:14:18 server id 1943306 end_log_pos 701570143 Xid = 800998400 | |
COMMIT/*!*/; | |
# at 701570143 | |
#230822 14:14:18 server id 1943306 end_log_pos 701570204 GTID last_committed=26491 sequence_number=26521 rbr_only=yes | |
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; |
论断:依据复制的报错信息得悉具体的 GTID 号以及主集群的 binlog 文件,解析 binlog 得悉此事务为一条 INSERT 语句,语句中的指标表与 performance_schema.replication_applier_status_by_worker
表中信息统一
3.3、寻找主集群指标表 binlog 中是否有建表语句
在同一 binlog 日志中寻找建表语句
SET TIMESTAMP=1692684495/*!*/; | |
CREATE DATABASE IF NOT EXISTS `abs_xxx` /*!40100 DEFAULT CHARACTER SET utf8 COLLATE utf8_bin */ | |
/*!*/; | |
…… | |
use `information_schema`/*!*/; | |
SET TIMESTAMP=1692684495/*!*/; | |
CREATE TABLE `abs_xxx`.`tmp_xxx_info` (`ID` int(64) NOT NULL AUTO_INCREMENT, | |
`CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL, | |
`STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL, | |
`AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`END_DATE` datetime DEFAULT NULL, | |
PRIMARY KEY (`ID`), | |
KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE, | |
KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE | |
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC | |
/*!*/; | |
# at 475864451 |
论断:在主集群的 binlog 日志中找到了指标表的建表语句,阐明主集群执行 DDL 时并没有敞开 binlog 日志,那么持续查看在灾备集群的中继日志中是否存在 DDL 语句
3.4、解析灾备集群的中继日志,确认是否拉取到灾备集群
#230822 14:08:15 server id 1943306 end_log_pos 475863662 GTID last_committed=16341 sequence_number=16342 rbr_only=no | |
SET @@SESSION.GTID_NEXT= '9e668a93-2618-11ee-93ee-bc16954181bb:47498079'/*!*/; | |
…… | |
use `information_schema`/*!*/; | |
SET TIMESTAMP=1692684495/*!*/; | |
CREATE TABLE `abs_xxx`.`tmp_xxx_info` (`ID` int(64) NOT NULL AUTO_INCREMENT, | |
`CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL, | |
`STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL, | |
`AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`END_DATE` datetime DEFAULT NULL, | |
PRIMARY KEY (`ID`), | |
KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE, | |
KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE | |
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC | |
/*!*/; | |
# at 475864660 | |
#230822 14:08:15 server id 1943306 end_log_pos 475864512 GTID last_committed=16342 sequence_number=16343 rbr_only=yes | |
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; |
论断:灾备集群的中继日志中存在 DDL 建表语句,阐明并不是 IO 线程出了问题
3.5、排查复制配置的疏忽库表
Replicate_Ignore_DB: mysql,dbscale,dbscale_tmp,information_schema,performance_schema,sys | |
Replicate_Wild_Ignore_Table: A.ab,B.bc |
论断:疏忽库表中并不蕴含指标表,然而依据以上解析日志发现,在主集群 binlog 日志中建表语句之前有个 use information_schema/!/;
的语句,此库为同步疏忽的零碎库,因而触发了 GreatSQL 的标准限度,在疏忽库下对未疏忽进行操作 Statement 模式下记录语句默认不起作用(详情:https://dev.mysql.com/doc/refman/5.7/en/replication-options-r…)
4、解决同步报错
在灾备集群创立指标表
greatsql> CREATE TABLE `abs_xxx`.`tmp_xxx_info` (`ID` int(64) NOT NULL AUTO_INCREMENT, | |
`CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL, | |
`STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL, | |
`AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL, | |
`END_DATE` datetime DEFAULT NULL, | |
PRIMARY KEY (`ID`), | |
KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE, | |
KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE | |
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC; | |
greatsql> stop slave; | |
greatsql> start slave; |
论断:在灾备集群创立指标表后重启复制复原胜利
三、限度躲避
1、第一种躲避形式
执行 DDL 时进入指标库
greatsql> use abs_cust | |
greatsql> DDL 语句(CREATE\DROP\ALTER) |
阐明:在利用连贯数据库时有可能默认就是 information_schema
库,而此环境将零碎库全副疏忽,所以为了躲避相似的问题,请在执行 SQL 语句时请先 use 到指标表的指标库。
2、第二种躲避形式
批改主从复制配置,以下步骤为测试环境
敞开灾备集群在复制同步
greatsql> stop slave; | |
Query OK, 0 rows affected, 1 warning (0.03 sec) |
批改疏忽库
greatsql> change replication filter Replicate_Ignore_DB=();
批改疏忽表
greatsql> change replication filter replicate_wild_ignore_table =('mysql.%','information_schema.%','sys.%','performance_schema.%');
启动同步
greatsql> start slave; | |
Query OK, 0 rows affected, 1 warning (0.37 sec) |
测试验证
主集群:
greatsql> use mysql | |
Reading table information for completion of table and column names | |
You can turn off this feature to get a quicker startup with -A | |
Database changed | |
greatsql> create table test111.test111(id int primary key); | |
Query OK, 0 rows affected (0.06 sec) | |
greatsql> show tables; | |
+-------------------+ | |
| Tables_in_test111 | | |
+-------------------+ | |
| test111 | | |
+-------------------+ | |
1 row in set (0.00 sec) |
灾备集群:
greatsql> use test111 | |
Reading table information for completion of table and column names | |
You can turn off this feature to get a quicker startup with -A | |
Database changed | |
greatsql> show tables; | |
+-------------------+ | |
| Tables_in_test111 | | |
+-------------------+ | |
| test111 | | |
+-------------------+ | |
1 row in set (0.00 sec) |
阐明:复制配置中参数 Replicate_Ignore_DB
设置为空,将 replicate_wild_ignore_table
参数设置为 shema_name.%
的形式也能够躲避相似的问题
四、特地阐明
- 在 MySQL 5.7 跟 8.0 版本也存在此限度
Enjoy GreatSQL :)
## 对于 GreatSQL
GreatSQL 是实用于金融级利用的国内自主开源数据库,具备高性能、高牢靠、高易用性、高平安等多个外围个性,能够作为 MySQL 或 Percona Server 的可选替换,用于线上生产环境,且完全免费并兼容 MySQL 或 Percona Server。
相干链接:GreatSQL 社区 Gitee GitHub Bilibili
GreatSQL 社区:
社区博客有奖征稿详情:https://greatsql.cn/thread-100-1-1.html
技术交换群:
微信:扫码增加
GreatSQL 社区助手
微信好友,发送验证信息加群
。