一个线上数据失落故障案例,引出了在 GTID 模式下 AUTO POSITION MODE 的必要性。
作者:孙绪宗
新浪微博 DBA 团队工程师,次要负责 MySQL、PostgreSQL 等关系型数据库运维。
本文起源:原创投稿
* 爱可生开源社区出品,原创内容未经受权不得随便应用,转载请分割小编并注明起源。
浏览前需知
- Retrieved_Gtid_Set:从库曾经接管到主库的事务编号(从库的 IO 线程曾经承受到了)。
- Executed_Gtid_Set:曾经执行的事务编号(从库的执行 SQL 线程曾经执行了的 SQL)。
故障景象
主从数据不统一,然而看复制是失常状态(双 Yes)。此时主库执行,从库本该报错 1062 或者 1032 的 SQL,从库复制线程还是双 Yes,没有报错。
MySQL 版本:5.7.35
故障复现
查看主库状态
MySQL [xuzong]> show master status;
+------------------+----------+--------------+------------------+-----------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-----------------------------------------------+
| mysql-bin.000008 | 39349641 | | | c233aec0-58d3-11ec-a74a-a0a33ba8b3ff:1-104345 |
+------------------+----------+--------------+------------------+-----------------------------------------------+
从库 上只须要设置 GTID 值的 gno 大于主库的值即可复现。
mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)
mysql> reset slave all;
Query OK, 0 rows affected (0.01 sec)
mysql> reset master;
Query OK, 0 rows affected (0.01 sec)
mysql> set @@GLOBAL.GTID_PURGED='c233aec0-58d3-11ec-a74a-a0a33ba8b3ff:1-1000000'; #1000000>104345
Query OK, 0 rows affected (0.00 sec)
而后用 POSITION MODE 点位复制,启动从库,执行 show slave status \G
就会看到如下:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
......
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: c233aec0-58d3-11ec-a74a-a0a33ba8b3ff:104221-104345
Executed_Gtid_Set: c233aec0-58d3-11ec-a74a-a0a33ba8b3ff:1-1000000
Auto_Position: 0
Replicate_Rewrite_DB:
双 Yes,看上去一点问题都没有,然而这时候在主库执行任何操作,都不会被复制,因为复制时校验 GTID 时从库示意曾经执行过了。
主库
MySQL [xuzong]> insert into test(passtext) values('test');
Query OK, 1 row affected (0.00 sec)
从库
mysql> select * from xuzong.test;
+----+----------------------------------+
| id | passtext |
+----+----------------------------------+
| 1 | 7e5a44af63552be3f2f819cebbe0832a |
| 2 | 7e5a44af63552be3f2f819cebbe0832a |
| 3 | 7e5a44af63552be3f2f819cebbe0832a |
| 4 | d13baf7019431e0c75dee85bc923a91b |
| 5 | 11 |
+----+----------------------------------+
5 rows in set (0.00 sec)
# 从库能看进去 Retrieved_Gtid_Set 在变动
mysql> show slave status \G
Retrieved_Gtid_Set: c233aec0-58d3-11ec-a74a-a0a33ba8b3ff:104221-104367
Executed_Gtid_Set: c233aec0-58d3-11ec-a74a-a0a33ba8b3ff:1-1000000
问题在于,这种不应该报错嘛,从库的 GTID 大于主库的 GTID???
猜想
本源在于没有设置 Auto_Position=1
,那么接下来验证一下猜想。
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replicated to the slave. Suggest to replicate any transactions that master has rolled back from slave to master, and/or commit empty transactions on master to account for transactions that have been'
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1362229455
Master_UUID: c233aec0-58d3-11ec-a74a-a0a33ba8b3ff
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 230817 11:11:44
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: c233aec0-58d3-11ec-a74a-a0a33ba8b3ff:1-1000000
Auto_Position: 1
果不其然报错了,验证了猜想。
问题解决
这种状况目前看只能重做从库。
因为案例中复制过程为双 Yes,无奈被监控捕捉,同时无奈得悉数据不统一呈现的工夫点,所以无奈通过剖析 binlog 复原。
本案例中的实例还进行过主从切换,只能联系业务方做一次全量的数据比照,失去一个残缺数据的实例,而后从新构建从库。
总结
最初总结一下 Auto_Position
的作用。
- 会立刻清理原来的 Relay Log。
- 依据从库的
Executed_Gtid_Set
和Retrieved_Gtid_Set
的并集定位 Binlog。MASTER_LOG_FILE
和MASTER_LOG_POS
不存实例的值。
生产上实例开了 GTID 的话,主从复制倡议用 AUTO_POSITION MODE 的形式,即 Auto_Position=1
。防止一些未知操作导致从库复制线程没有预期的报错,进而导致从库未正确复制,数据失落。
拓展:源码解读
rel_slave.cc
--> request_dump
局部源码:.....
enum_server_command command= mi->is_auto_position() ?
COM_BINLOG_DUMP_GTID : COM_BINLOG_DUMP;
......
if (command == COM_BINLOG_DUMP_GTID)
{if (gtid_executed.add_gtid_set(mi->rli->get_gtid_set()) != RETURN_STATUS_OK || // 加 Retrieved_Gtid_Set
gtid_executed.add_gtid_set(gtid_state->get_executed_gtids()) !=
RETURN_STATUS_OK) // 加 Executed_Gtid_Set
......
int2store(ptr_buffer, binlog_flags);
ptr_buffer+= ::BINLOG_FLAGS_INFO_SIZE;
int4store(ptr_buffer, server_id);
ptr_buffer+= ::BINLOG_SERVER_ID_INFO_SIZE;
int4store(ptr_buffer, static_cast<uint32>(BINLOG_NAME_INFO_SIZE));
ptr_buffer+= ::BINLOG_NAME_SIZE_INFO_SIZE;
memset(ptr_buffer, 0, BINLOG_NAME_INFO_SIZE); // 设置 MASTER_LOG_FILE 为 0
ptr_buffer+= BINLOG_NAME_INFO_SIZE;
int8store(ptr_buffer, 4LL); // 设置 MASTER_LOG_POS 为 4
ptr_buffer+= ::BINLOG_POS_INFO_SIZE; // 存 gtid_set
如果 relay_log_recovery
是关上的,则疏忽 Retrieved_Gtid_Set
值。
/*
In the init_gtid_set below we pass the mi->transaction_parser.
This will be useful to ensure that we only add a GTID to
the Retrieved_Gtid_Set for fully retrieved transactions. Also, it will
be useful to ensure the Retrieved_Gtid_Set behavior when auto
positioning is disabled (we could have transactions spanning multiple
relay log files in this case).
We will skip this initialization if relay_log_recovery is set in order
to save time, as neither the GTIDs nor the transaction_parser state
would be useful when the relay log will be cleaned up later when calling
init_recovery.
*/ // 正文解释的很分明,relay_log_recovery=1 会跳过初始化 gtid_retrieved_initialized
if (!is_relay_log_recovery &&
!gtid_retrieved_initialized &&
relay_log.init_gtid_sets(>id_set, NULL,
opt_slave_sql_verify_checksum,
true/*true=need lock*/,
&mi->transaction_parser, >id_partial_trx))
{sql_print_error("Failed in init_gtid_sets() called from Relay_log_info::rli_init_info().");
DBUG_RETURN(1);
}
gtid_retrieved_initialized= true;
#ifndef NDEBUG
global_sid_lock->wrlock();
gtid_set.dbug_print("set of GTIDs in relay log after initialization");
global_sid_lock->unlock();
#endif