作者:付祥
现居珠海,次要负责 Oracle、MySQL、mongoDB 和 Redis 保护工作。
本文起源:原创投稿
*爱可生开源社区出品,原创内容未经受权不得随便应用,转载请分割小编并注明起源。
1. 环境阐明
1.1. MySQL版本
root@3306 (none)> SELECT @@VERSION;+------------+| @@VERSION |+------------+| 5.7.34-log |+------------+1 row in set (0.01 sec)
1.2. binlog 相干参数
root@3306 (none)> select @@log_bin,@@log_slave_updates;+-----------+---------------------+| @@log_bin | @@log_slave_updates |+-----------+---------------------+| 1 | 1 |+-----------+---------------------+1 row in set (0.00 sec)root@3306 (none)>
1.3. GTID 相干参数
root@3306 (none)> select@@binlog_gtid_simple_recovery,@@enforce_gtid_consistency,@@gtid_mode;+-------------------------------+----------------------------+-------------+| @@binlog_gtid_simple_recovery | @@enforce_gtid_consistency | @@gtid_mode |+-------------------------------+----------------------------+-------------+| 1 | ON | ON |+-------------------------------+----------------------------+-------------+1 row in set (0.01 sec)root@3306 (none)>
1.4. 半同步相干参数
root@3306 (none)> show global variables like '%semi%';+-------------------------------------------+------------+| Variable_name | Value |+-------------------------------------------+------------+| rpl_semi_sync_master_enabled | ON || rpl_semi_sync_master_timeout | 1000 || rpl_semi_sync_master_trace_level | 32 || rpl_semi_sync_master_wait_for_slave_count | 1 || rpl_semi_sync_master_wait_no_slave | ON || rpl_semi_sync_master_wait_point | AFTER_SYNC || rpl_semi_sync_slave_enabled | ON || rpl_semi_sync_slave_trace_level | 32 |+-------------------------------------------+------------+8 rows in set (0.00 sec)root@3306 (none)>
1.5. 多线程同步相干参数
root@3306 (none)> select@@binlog_transaction_dependency_tracking,@@slave_parallel_type,@@slave_parallel_workers;+------------------------------------------+-----------------------+--------------------------+| @@binlog_transaction_dependency_tracking | @@slave_parallel_type |@@slave_parallel_workers |+------------------------------------------+-----------------------+--------------------------+| COMMIT_ORDER | LOGICAL_CLOCK |2 |+------------------------------------------+-----------------------+--------------------------+1 row in set (0.00 sec)root@3306 (none)>
2.故障景象
MySQL 从库所在主机故障重启后,sql_thread 线程报错:
root@3306 (none)> show slave status\G-- 摘取有用信息如下:Slave_IO_Running: YesSlave_SQL_Running: NoReplicate_Do_DB:Replicate_Ignore_DB:Replicate_Do_Table:Replicate_Ignore_Table:Replicate_Wild_Do_Table:Replicate_Wild_Ignore_Table:Last_Errno: 1062Last_Error: Coordinator stopped because there wereerror(s) in the worker(s). The most recent failure being: Worker 1 failedexecuting transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' at masterlog MySql-bin.000685, end_log_pos 1004756557. See error log and/orperformance_schema.replication_applier_status_by_worker table for more detailsabout this failure or others, if any.......省略..................Retrieved_Gtid_Set: 471c2974-f9bb-11eb-afb1-52540010fb89:88313207-88341912Executed_Gtid_Set: 471c2974-f9bb-11eb-afb1-52540010fb89:1-88313206,d4c228df-f9c6-11eb-a2d8-5254006f63b6:1-5Auto_Position: 1root@3306 (none)> select * fromperformance_schema.replication_applier_status_by_worker\G*************************** 1. row ***************************CHANNEL_NAME:WORKER_ID: 1THREAD_ID: NULLSERVICE_STATE: OFFLAST_SEEN_TRANSACTION: 471c2974-f9bb-11eb-afb1-52540010fb89:88313207LAST_ERROR_NUMBER: 1062LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' at master log MySql-bin.000685, end_log_pos1004756557; Could not execute Write_rows event on table kefumobile.i_sms_proxy;Duplicate entry '14765130' for key 'PRIMARY', Error_code: 1062; handler errorHA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1004756557LAST_ERROR_TIMESTAMP: 2022-01-24 23:05:02*************************** 2. row ***************************CHANNEL_NAME:WORKER_ID: 2THREAD_ID: NULLSERVICE_STATE: OFFLAST_SEEN_TRANSACTION:LAST_ERROR_NUMBER: 0LAST_ERROR_MESSAGE:LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:002 rows in set (0.00 sec)
通过报错信息可知,worker 线程在回放事务 '471c2974-f9bb-11eb-afb1-52540010fb89:88313207'
时,因为要插入的记录主键抵触报错。
3.故障剖析
主机重启前,主从同步失常,主机重启后,主从同步因为主键抵触报错,比照了抵触主键所在行记
录在主从库是统一的,初步剖析事务 '471c2974-f9bb-11eb-afb1-52540010fb89:88313207'
在主机故
障前曾经在从库进行了回放,那为何事务会反复回放呢?
在开启gtid模式下,如果指定 master_auto_position=1,start slave 时,从库会把
Retrieved_Gtid_Set 和 Executed_Gtid_Set 的并集发送给主库,主库将收到的并集和本人的
gtid_executed 比拟,把从库 gtid 汇合里缺失的事务全都发送给从库。
主机重启后,事务反复回放,表明 Retrieved_Gtid_Set 和 Executed_Gtid_Set 的并集中有 GTID 事务
失落,导致反复获取事务执行引发主键抵触谬误。Retrieved_Gtid_Set 和 Executed_Gtid_Set 均为内存变
量,MySQL 重启后,Retrieved_Gtid_Set 初始化为空值,从而推断出 Executed_Gtid_Set 有 GTID 事务丢
失。
Executed_Gtid_Set 来源于 gtid_executed 变量,gtid_executed 变量长久化介质有
mysql.gtid_executed 表和 binlog ,其中 mysql.gtid_executed 表是 MySQL 5.7 后引入的,在 MySQL 5.6 中,从库要应用 GTID ,必须要先设置 log_bin=on,log_slave_updates=on ,因为从库执行过的 GTID 只保留在 binlog 中。
- 当 log_bin=on ,log_slave_updates=off 时,gtid_executed 变量的更新实时长久化到 mysql.gtid_executed 表中,MySQL 重启后 gtid_executed 变量值从 mysql.gtid_executed 表读取。
- 当 log_bin=on ,log_slave_updates=on 时,只有在 binlog 切换时侯才会更新 mysql.gtid_executed 表,保留直到上一个 binlog 执行过的 GTID 汇合。mysql 重启后,在默认参数 binlog_gtid_simple_recovery=1 时,gtid_executed 变量值从最初一个 binlog 文件计算取得。
gtid_executed 变量值古老,推断出 binlog 未实时长久化,咱们看一下参数 sync_binlog :
root@3306 (none)> show variables like 'sync_binlog';+---------------+-------+| Variable_name | Value |+---------------+-------+| sync_binlog | 600 |+---------------+-------+1 row in set (0.00 sec)
通过以上剖析,此次故障前因后果就分明了:
Worker 线程报 1062 主键抵触谬误 --> gtid_executed 信息古老 --> binlog 未实时长久化
4.测试验证
搭建一主一从测试环境,通过 sysbench 模仿主库并发插入,从库主机暴力关机后,故障复现:
root@mysql.sock][(none)]> select * fromperformance_schema.replication_applier_status_by_worker\G*************************** 1. row ***************************CHANNEL_NAME:WORKER_ID: 1THREAD_ID: NULLSERVICE_STATE: OFFLAST_SEEN_TRANSACTION: 4a0ad3da-b89e-11eb-9d0b-000c299b4d6c:452362LAST_ERROR_NUMBER: 1062LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '4a0ad3da-b89e-11eb-9d0b-000c299b4d6c:452362' at master log MySql-bin.000012, end_log_pos1011339749; Could not execute Write_rows event on table sbtest.sbtest5;Duplicate entry '111305' for key 'PRIMARY', Error_code: 1062; handler errorHA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1011339749LAST_ERROR_TIMESTAMP: 2022-01-26 09:56:38*************************** 2. row ***************************CHANNEL_NAME:WORKER_ID: 2THREAD_ID: NULLSERVICE_STATE: OFFLAST_SEEN_TRANSACTION:LAST_ERROR_NUMBER: 0LAST_ERROR_MESSAGE:LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:002 rows in set (0.00 sec)[root@mysql.sock][(none)]>
5.故障解决
既然谬误起因是事务反复执行,那跳过谬误就好了,有如下两种形式,依据须要选取其中一种形式执行:
5.1.通过空事务代替报错事务执行
set gtid_next='xxxxxx';begin;commit;set gtid_next=AUTOMATIC;start slave sql_thread
如果最新 binglog 失落的 GTID 较多,手工执行比拟繁琐,须要一直试错。可写一个存储过程批量执行:
set sql_log_bin=0;DELIMITER $$create procedure p_fx_gtid_next(i_master_Executed_Gtid_max varchar(100))begindeclare v_gtid_next varchar(100);declare master_Executed_Gtid_max varchar(100);declare slave_Executed_Gtid_max varchar(100);-- 主库以后执行了的gtid最大值,做为退出循环条件 show master statusset master_Executed_Gtid_max=i_master_Executed_Gtid_max;loop_name:loopSELECT ifnull(min(LAST_SEEN_TRANSACTION),'empty') FROMperformance_schema.replication_applier_status_by_worker WHERELAST_ERROR_NUMBER=1062into v_gtid_next;if v_gtid_next <> 'empty' thenset gtid_next = v_gtid_next;start transaction;commit;set gtid_next =AUTOMATIC;start slave sql_thread;end if;select max(LAST_SEEN_TRANSACTION) fromperformance_schema.replication_applier_status_by_worker intoslave_Executed_Gtid_max;if slave_Executed_Gtid_max = master_Executed_Gtid_max thenleave loop_name;end if;select SLEEP(1);end loop;end $$DELIMITER ;set sql_log_bin=1;call p_fx_gtid_next('XXXXX:XXX');
5.2.带参数 slave_skip_errors=1062 重启 MySQL
待主从同步失常后,再勾销参数 slave_skip_errors 设置重启 MySQL 。