关于备份:故障分析-血的教训由慢查询引发的备份等待导致数据库连接打满

作者：马文斌
MySQL爱好者,任职于蓝月亮(中国)有限公司。
本文起源：原创投稿
*爱可生开源社区出品，原创内容未经受权不得随便应用，转载请分割小编并注明起源。

背景

早上一大早被拉去开早会，感觉事件不妙，得悉是某中台(发券)数据库不能失常提供拜访。呈现大量的上面状况：

| 55279262 | xxx_app | 192.168.50.143:44623 | xxx_mam | Query | 206 | Waiting for table flush | SELECT count(*) FROM mam_user_coupon_package AS a LEFT JOIN mam_coupon_apply_scope as d on a.coupon_info_id = d.coupon_info_id WHERE 1=1 AND a.usable_flag = 1 AND a.user_id = 102101100050614656 and a.coupon_status in (1) and d.business_code = 30006 AND a.end_time >= '2022-04-25 08:56:49.683'                                                                                                                  || 55278572 | xxx_app | 192.168.50.144:40013 | xxx_mam | Query | 315 | Waiting for table flush | SELECT coupon_user_code,last_update_time,operator_name,operator_id,create_time,creator_name,creator_id,del_flag,user_id,coupon_status,coupon_info_id,start_time,end_time,use_time,activate_time,activate_end_time,coupon_type,coupon_info_type,act_id,usable_flag FROM mam_user_coupon_package WHERE ( ( end_time <= '2022-04-25 08:55:00.638' and usable_flag = 1 and del_flag = 0 and coupon_status = 1 ) ) limit 0,1000| 55165820 | root | localhost | NULL | Query | 28956 | Waiting for table flush | FLUSH NO_WRITE_TO_BINLOG TABLES |

字面意思是期待表刷新，然而这只是个查问，为啥要刷新表呢，而且是几百上千个全局连贯进来，一时半会找不到解决方案，只有先救火后剖析，那就先把查问干掉再说

select concat('kill ', id , ';') from PROCESSLIST where USER='xxx_app' and STATE='Waiting for table flush' ;

前面发现不行啊，早上一直有连贯申请连贯进来，这是指标不治标，没方法只能用重启大法，重启完之后数据库终于又复原了来日的静宁了。

预先剖析：

有SQL 与备份操作抵触导致卡死呈现。

220425 01:01:48 [01] Streaming ./xx_ums/ums_channal_authorized.ibd220425 01:01:48 [01] ...done220425 01:01:48 >> log scanned up to (425114657787)220425 01:01:48 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...220425 01:01:49 >> log scanned up to (425114657787)……220425 09:05:52 >> log scanned up to (425114692507)Error: failed to execute query 'FLUSH NO_WRITE_TO_BINLOG TABLES': 2013 (HY000) Lost connection to MySQL server during query

这里是局部备份时产生的日志文件，要害信息就在这里，flush table 操作的不记录到 binlog 里，防止从库也执行同样的操作。

于是带着疑难，1、要么有长事务没有提交，2、要么有大的慢查问阻塞

备份原理:

备份流程图大抵如下：

1、依据链接信息做数据库的连通性和版本检测2、读取数据库相干的配置信息（数据和日志文件地位等信息）3、开启redo日志拷贝线程，从最新的checkpoint开始程序拷贝redolog到xtrabackup_logfile中4、开始备份 .ibd ibdata1 undo 等innodb文件5、执行FLUSH NO_WRITE_TO_BINLOG TABLES 和  FLUSH TABLES WITH READ LOCK    1）敞开所有关上的表，强制敞开所有正在应用的表,不写入binlog    2）敞开所有关上的表，强制敞开所有正在应用的表，并给所有表减少全局read lock6、拷贝非innodb表相干文件信息7、读取binlog pos信息并写入xtrabackup_binlog_info文件8、执行 FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS    1)刷新redo log buffer中的日志到磁盘文件，确保redo log copy线程拷贝这最初的redolog日志数据（为什么说是最初的    redolog日志数据，因为在FLUSH TABLES WITH READ LOCK加锁之后，应用UNLOCK TABLES开释全局读锁之前，不会再有新    的申请进来）    2）为什么确保所有redo日志拷贝残缺（因为没有拷贝binlog，如果最初一组提交的事物redo的commit标记没有落盘复原时    会失落最初一组事物)9、进行redolog日志拷贝线程10、执行UNLOCK TABLES开释表锁11、收尾生成backup-my.cnf和xtrabackup_info等文件，备份实现。

能够看到咱们备份的时候是卡在这一步：

FLUSH NO_WRITE_TO_BINLOG TABLES

1）敞开所有关上的表，强制敞开所有正在应用的表,不写入binlog

ok，十分清晰，咱们从2个方面动手，写 | 读

场景1:模仿未提交的写事务是否阻塞备份

序号-工夫线	trx1	backup
1	begin;
2	update tb1 set id =10 where id =1;
3		/usr/bin/innobackupex --defaults-file=/etc/my.cnf --user=bigdata_admin --password=123456 --socket=/mysqlData/data/mysql.sock --port=3306 /data/backup/
4		220425 15:13:09 completed OK!

备份输入的日志：

220425 15:13:08 [01] Copying ./db1/db.opt to /data/backup//2022-04-25_15-12-25/db1/db.opt220425 15:13:08 [01]        ...done220425 15:13:08 [01] Copying ./db1/tb1.frm to /data/backup//2022-04-25_15-12-25/db1/tb1.frm220425 15:13:08 [01]        ...done220425 15:13:08 Finished backing up non-InnoDB tables and files220425 15:13:08 [00] Writing xtrabackup_binlog_info220425 15:13:08 [00]        ...done220425 15:13:08 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...xtrabackup: The latest check point (for incremental): '2447317'xtrabackup: Stopping log copying thread..220425 15:13:08 >> log scanned up to (2447326)220425 15:13:08 Executing UNLOCK TABLES220425 15:13:08 All tables unlocked220425 15:13:08 [00] Copying ib_buffer_pool to /data/backup//2022-04-25_15-12-25/ib_buffer_pool220425 15:13:08 [00]        ...done220425 15:13:08 Backup created in directory '/data/backup//2022-04-25_15-12-25'MySQL binlog position: filename 'mysql-bin.000003', position '1032', GTID of the last change '74fd0ba9-c45b-11ec-97a4-525400abe2c0:1-8'220425 15:13:08 [00] Writing backup-my.cnf220425 15:13:08 [00]        ...done220425 15:13:08 [00] Writing xtrabackup_info220425 15:13:08 [00]        ...donextrabackup: Transaction log of lsn (2447317) to (2447326) was copied.220425 15:13:09 completed OK!

论断：这个没提交的写事务是不会阻塞备份

场景2:模仿事务中的慢查问是否阻塞备份

序号-工夫线	trx1	backup
1	begin;
2	select id,sleep(1000) from tb1;
3		/usr/bin/innobackupex --defaults-file=/etc/my.cnf --user=bigdata_admin --password=123456 --socket=/mysqlData/data/mysql.sock --port=3306 /data/backup/
4		220425 15:20:58 [01] Copying ./db1/tb1.ibd to /data/backup//2022-04-25_15-20-21/db1/tb1.ibd<br/>220425 15:20:58 [01] ...done<br/>220425 15:20:58 >> log scanned up to (2447326)<br/>220425 15:20:59 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...<br/>220425 15:20:59 >> log scanned up to (2447326)<br/>220425 15:21:00 >> log scanned up to (2447326)<br/>220425 15:21:01 >> log scanned up to (2447326)<br/>220425 15:21:02 >> log scanned up to (2447326)<br/>220425 15:21:03 >> log scanned up to (2447326)<br/>220425 15:21:04 >> log scanned up to (2447326)<br/>220425 15:21:05 >> log scanned up to (2447326)<br/>220425 15:21:06 >> log scanned up to (2447326)<br/>220425 15:21:07 >> log scanned up to (2447326)<br/>220425 15:21:08 >> log scanned up to (2447326)<br/>220425 15:21:09 >> log scanned up to (2447326)<br/>220425 15:21:10 >> log scanned up to (2447326)<br/>220425 15:21:11 >> log scanned up to (2447326)<br/>220425 15:21:12 >> log scanned up to (2447326)<br/>220425 15:21:13 >> log scanned up to (2447326)<br/>220425 15:21:14 >> log scanned up to (2447326)<br/>220425 15:21:15 >> log scanned up to (2447326)
5	localhost 15:12:19 [db1]> select id,sleep(1000) from tb1;<br/>ERROR 3024 (HY000): Query execution was interrupted, maximum statement execution time exceeded
6		220425 15:22:18 [00] Writing xtrabackup_binlog_info<br/>220425 15:22:18 [00] ...done<br/>220425 15:22:18 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...<br/>xtrabackup: The latest check point (for incremental): '2447317'<br/>xtrabackup: Stopping log copying thread.<br/>.220425 15:22:18 >> log scanned up to (2447326)<br/><br/>220425 15:22:18 Executing UNLOCK TABLES<br/>220425 15:22:18 All tables unlocked<br/>220425 15:22:18 [00] Copying ib_buffer_pool to /data/backup//2022-04-25_15-20-21/ib_buffer_pool<br/>220425 15:22:18 [00] ...done<br/>220425 15:22:18 Backup created in directory '/data/backup//2022-04-25_15-20-21'<br/>MySQL binlog position: filename 'mysql-bin.000003', position '1032', GTID of the last change '74fd0ba9-c45b-11ec-97a4-525400abe2c0:1-8'<br/>220425 15:22:18 [00] Writing backup-my.cnf<br/>220425 15:22:18 [00] ...done<br/>220425 15:22:18 [00] Writing xtrabackup_info<br/>220425 15:22:18 [00] ...done<br/>xtrabackup: Transaction log of lsn (2447317) to (2447326) was copied.<br/>220425 15:22:18 completed OK!

咱们能够看到，直到sql执行超时，也就是意味着表敞开了，备份才胜利。

论断：事务中的慢查问会阻塞备份

场景3：模仿不带事务慢查问是否阻塞备份

序号-工夫线	trx-1	backup
1	select id,sleep(1000) from tb1;
2		/usr/bin/innobackupex --defaults-file=/etc/my.cnf --user=bigdata_admin --password=123456 --socket=/mysqlData/data/mysql.sock --port=3306 /data/backup/
3		220425 15:26:03 [01] ...done<br/>220425 15:26:03 [01] Copying ./db1/tb1.ibd to /data/backup//2022-04-25_15-25-27/db1/tb1.ibd<br/>220425 15:26:03 [01] ...done<br/>220425 15:26:03 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...<br/>220425 15:26:04 >> log scanned up to (2447326)<br/>220425 15:26:05 >> log scanned up to (2447326)<br/>220425 15:26:06 >> log scanned up to (2447326)<br/>220425 15:26:07 >> log scanned up to (2447326)<br/>220425 15:26:08 >> log scanned up to (2447326)<br/>220425 15:26:09 >> log scanned up to (2447326)
4	localhost 15:22:17 [db1]> select id,sleep(1000) from tb1;<br/>ERROR 3024 (HY000): Query execution was interrupted, maximum statement execution time exceeded
5		220425 15:27:23 [00] ...done<br/>220425 15:27:23 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...<br/>xtrabackup: The latest check point (for incremental): '2447317'<br/>xtrabackup: Stopping log copying thread.<br/>.220425 15:27:23 >> log scanned up to (2447326)<br/><br/>220425 15:27:23 Executing UNLOCK TABLES<br/>220425 15:27:23 All tables unlocked<br/>220425 15:27:23 [00] Copying ib_buffer_pool to /data/backup//2022-04-25_15-25-27/ib_buffer_pool<br/>220425 15:27:23 [00] ...done<br/>220425 15:27:23 Backup created in directory '/data/backup//2022-04-25_15-25-27'<br/>MySQL binlog position: filename 'mysql-bin.000003', position '1032', GTID of the last change '74fd0ba9-c45b-11ec-97a4-525400abe2c0:1-8'<br/>220425 15:27:23 [00] Writing backup-my.cnf<br/>220425 15:27:23 [00] ...done<br/>220425 15:27:23 [00] Writing xtrabackup_info<br/>220425 15:27:23 [00] ...done<br/>xtrabackup: Transaction log of lsn (2447317) to (2447326) was copied.<br/>220425 15:27:23 completed OK!
6

咱们能够看到，直到sql执行超时，也就是意味着表敞开了，备份拿到了备份锁，备份才胜利。对于备份锁的材料请查看这里:https://www.percona.com/doc/p...

论断：不带事务的慢查问也会阻塞备份

FLUSH TABLES

FLUSH NO_WRITE_TO_BINLOG TABLES = FLUSH TABLES ，只不过这种flush tables 的动作是不会写入到binlog文件中，而flush tables 是须要获取锁。官网文件这样形容flush tables

The FLUSH statement has several variant forms that clear or reload various internal caches, flush tables, or acquire locks.By default, the server writes FLUSH statements to the binary log so that they replicate to replicas. To suppress logging, specify the optional NO_WRITE_TO_BINLOG keyword or its alias LOCAL.FLUSH TABLES flushes tables, and, depending on the variant used, acquires locks. Any TABLES variant used in a FLUSH statement must be the only option used. FLUSH TABLE is a synonym for FLUSH TABLES.Closes all open tables, forces all tables in use to be closed, and flushes the query cache and prepared statement cache. FLUSH TABLES also removes all query results from the query cache, like the RESET QUERY CACHE statement. For information about query caching and prepared statement caching

大略意思就是：敞开所有关上的表，强制敞开所有正在应用的表，并刷新查问缓存和筹备好的语句缓存。FLUSH TABLES还会从查问缓存中删除所有查问后果，例如 RESET QUERY CACHE语句。无关查问缓存和筹备好的语句缓存的信息

反推生产环境

查看慢查问

Oh my god，在32.197服务器的慢查问中搜寻到一个执行了10几个小时的慢查问

# Time: 2022-04-25T09:12:53.416049+08:00# User@Host: BlueMoon[BlueMoon] @  [192.168.50.144]  Id: 55128898# Query_time: 38910.325000  Lock_time: 0.000328 Rows_sent: 0  Rows_examined: 14637241SET timestamp=1650849173;SELECT        d.coupon_master_codeFROM        mam_coupon_activity_interact a        LEFT JOIN mam_coupon_send_log b ON a.id = b.interact_id        LEFT JOIN mam_user_coupon_package c ON c.coupon_user_code = b.coupon_user_code        left join        mam_coupon_info  d on c.coupon_info_id = d.idWHERE        a.interact_type = 3GROUP BY        d.coupon_master_code;

38910秒=10.8 小时，快11个钟了， Time: 2022-04-25T09:12:53 ，往前推38910秒，也就是 2022-04-24 22:24:23 开始执行这个sql。

为什么会这么慢，咱们看下执行计

表构造如下：

mysql> show create table mam_coupon_send_log \G*************************** 1. row ***************************       Table: mam_coupon_send_logCreate Table: CREATE TABLE `mam_coupon_send_log` (  `id` int(10) NOT NULL AUTO_INCREMENT COMMENT '主键',  `last_update_time` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最初更新工夫',  `operator_name` varchar(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '操作人名称',  `operator_id` int(11) DEFAULT NULL COMMENT '操作人编号',  `create_time` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创立工夫',  `creator_name` varchar(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '创建人名称',  `creator_id` int(11) DEFAULT NULL COMMENT '创建人编号',  `del_flag` tinyint(1) DEFAULT '0' COMMENT '是否删除（0：否，1：是）',  `shop_order_id` bigint(20) DEFAULT NULL COMMENT '订单编号',  `activity_code` bigint(20) DEFAULT NULL COMMENT '流动编号',  `coupon_user_code` bigint(20) DEFAULT NULL COMMENT '用户券编号',  `user_id` bigint(20) DEFAULT NULL COMMENT '领券人id',  `mensend_id` int(11) DEFAULT NULL COMMENT '人工发券id',  `interact_id` bigint(20) DEFAULT NULL COMMENT '互动赠券id',  `send_type` tinyint(1) DEFAULT NULL COMMENT '发券类型(1人工发券2积分兑券3买单赠券4数据迁徙5券转增6券转换7营销互动)',  `act_id` int(11) DEFAULT NULL COMMENT '流动id',  `integral_record_id` bigint(20) DEFAULT NULL COMMENT '积分兑换记录表id',  `interact_detail_id` bigint(20) DEFAULT NULL COMMENT '互动赠券明细id',  PRIMARY KEY (`id`) USING BTREE,  KEY `idx_coupon_use_code` (`coupon_user_code`) USING BTREE COMMENT '卡券包编码索引',  KEY `idx_shop_order_id_send_type` (`shop_order_id`,`send_type`) USING BTREE) ENGINE=InnoDB AUTO_INCREMENT=551413 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC COMMENT='用户券发送日志表'

倡议对interact_id 字段加索引

alter table mam_coupon_send_log add index idx_interact_id(interact_id);

而咱们那点备份是从早晨1点开始备份，也就是说这个备份始终在期待这个查问完结,合乎场景3:

查看数据库的sql超时是否设置

mysql> show variables like '%exec%';+----------------------------------+--------+| Variable_name                    | Value  |+----------------------------------+--------+| gtid_executed_compression_period | 1000   || max_execution_time               | 0      || rbr_exec_mode                    | STRICT || slave_exec_mode                  | STRICT |+----------------------------------+--------+

。。。。。竟然没有设置sql查问超时工夫

倡议设置sql超时工夫

set global max_execution_time = 120000;120秒超时

反推备份优化

设置超时工夫

https://www.percona.com/doc/p...

--ftwrl-wait-timeout=SECONDSThis option specifies time in seconds that innobackupex should wait for queries that would block FLUSH TABLES WITH READ LOCK before running it. If there are still such queries when the timeout expires, innobackupex terminates with an error. Default is 0, in which case innobackupex does not wait for queries to complete and starts FLUSH TABLES WITH READ LOCK immediately. Where supported (Percona Server 5.6+) xtrabackup will automatically use Backup Locks as a lightweight alternative to FLUSH TABLES WITH READ LOCK to copy non-InnoDB data to avoid blocking DML queries that modify InnoDB tables.--kill-long-queries-timeout=SECONDSThis option specifies the number of seconds innobackupex waits between starting FLUSH TABLES WITH READ LOCK and killing those queries that block it. Default is 0 seconds, which means innobackupex will not attempt to kill any queries. In order to use this option xtrabackup user should have PROCESS and SUPER privileges. Where supported (Percona Server 5.6+) xtrabackup will automatically use Backup Locks as a lightweight alternative to FLUSH TABLES WITH READ LOCK to copy non-InnoDB data to avoid blocking DML queries that modify InnoDB tables.

/usr/bin/innobackupex --defaults-file=/etc/my.cnf --user=bigdata_admin --password=123456 --socket=/mysqlData/data/mysql.sock --port=3306 --ftwrl-wait-timeout=5 --kill-long-queries-timeout=5 /data/backup/

总结

1.业务层面：须要定期优化慢查问，业务库的增删改查操作尽量优化到<0.2秒

2.数据库层面：设置慢查问执行超时工夫，超过了肯定工夫强制kill掉，比方查过20分钟，对于OLTP业务，如果一个 sql 20分钟没有执行完，那就不是 OLTP ，而是 OLAP 业务 ,应该放在大数据层解决或优化代码或sql

set global max_execution_time = 1200000;

3.备份层面:减少锁期待的超时工夫
--ftwrl-wait-timeout=5 --kill-long-queries-timeout=5

因为工夫仓促，可能有很多细节没有思考到，后续再优化。