关于备份:故障分析-血的教训由慢查询引发的备份等待导致数据库连接打满

25次阅读

共计 13150 个字符,预计需要花费 33 分钟才能阅读完成。

作者:马文斌

MySQL 爱好者, 任职于蓝月亮 (中国) 有限公司。

本文起源:原创投稿

* 爱可生开源社区出品,原创内容未经受权不得随便应用,转载请分割小编并注明起源。


背景

早上一大早被拉去开早会,感觉事件不妙,得悉是某中台 (发券) 数据库 不能失常提供拜访。呈现大量的上面状况:

| 55279262 | xxx_app | 192.168.50.143:44623 | xxx_mam | Query | 206 | Waiting for table flush | SELECT count(*) FROM mam_user_coupon_package AS a LEFT JOIN mam_coupon_apply_scope as d on a.coupon_info_id = d.coupon_info_id WHERE 1=1 AND a.usable_flag = 1 AND a.user_id = 102101100050614656 and a.coupon_status in (1) and d.business_code = 30006 AND a.end_time >= '2022-04-25 08:56:49.683'
                                                                                                                  |
| 55278572 | xxx_app | 192.168.50.144:40013 | xxx_mam | Query | 315 | Waiting for table flush | SELECT coupon_user_code,last_update_time,operator_name,operator_id,create_time,creator_name,creator_id,del_flag,user_id,coupon_status,coupon_info_id,start_time,end_time,use_time,activate_time,activate_end_time,coupon_type,coupon_info_type,act_id,usable_flag FROM mam_user_coupon_package WHERE (( end_time <= '2022-04-25 08:55:00.638' and usable_flag = 1 and del_flag = 0 and coupon_status = 1) ) limit 0,1000

| 55165820 | root | localhost | NULL | Query | 28956 | Waiting for table flush | FLUSH NO_WRITE_TO_BINLOG TABLES |

字面意思是期待表刷新,然而这只是个查问,为啥要刷新表呢,而且是几百上千个全局连贯进来,一时半会找不到解决方案,只有先救火后剖析,那就先把查问干掉再说

select concat('kill', id , ';') from PROCESSLIST where USER='xxx_app' and STATE='Waiting for table flush' ;

前面发现不行啊,早上一直有连贯申请连贯进来,这是指标不治标,没方法只能用重启大法,重启完之后数据库终于又复原了来日的静宁了。

预先剖析:

有 SQL 与 备份操作抵触导致卡死呈现。

220425 01:01:48 [01] Streaming ./xx_ums/ums_channal_authorized.ibd
220425 01:01:48 [01] ...done
220425 01:01:48 >> log scanned up to (425114657787)
220425 01:01:48 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...
220425 01:01:49 >> log scanned up to (425114657787)
……
220425 09:05:52 >> log scanned up to (425114692507)
Error: failed to execute query 'FLUSH NO_WRITE_TO_BINLOG TABLES': 2013 (HY000) Lost connection to MySQL server during query

这里是局部备份时产生的日志文件,要害信息就在这里,flush table 操作的不记录到 binlog 里,防止从库也执行同样的操作。

于是带着疑难,1、要么有长事务没有提交,2、要么有大的慢查问阻塞

备份原理:

备份流程图大抵如下:

1、依据链接信息做数据库的连通性和版本检测
2、读取数据库相干的配置信息(数据和日志文件地位等信息)3、开启 redo 日志拷贝线程,从最新的 checkpoint 开始程序拷贝 redolog 到 xtrabackup_logfile 中
4、开始备份 .ibd ibdata1 undo 等 innodb 文件
5、执行 FLUSH NO_WRITE_TO_BINLOG TABLES 和  FLUSH TABLES WITH READ LOCK
    1)敞开所有关上的表,强制敞开所有正在应用的表, 不写入 binlog
    2)敞开所有关上的表,强制敞开所有正在应用的表,并给所有表减少全局 read lock
6、拷贝非 innodb 表相干文件信息
7、读取 binlog pos 信息并写入 xtrabackup_binlog_info 文件
8、执行 FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS
    1)刷新 redo log buffer 中的日志到磁盘文件,确保 redo log copy 线程拷贝这最初的 redolog 日志数据(为什么说是最初的
    redolog 日志数据,因为在 FLUSH TABLES WITH READ LOCK 加锁之后,应用 UNLOCK TABLES 开释全局读锁之前,不会再有新
    的申请进来)2)为什么确保所有 redo 日志拷贝残缺(因为没有拷贝 binlog,如果最初一组提交的事物 redo 的 commit 标记没有落盘复原时
    会失落最初一组事物)
9、进行 redolog 日志拷贝线程
10、执行 UNLOCK TABLES 开释表锁
11、收尾生成 backup-my.cnf 和 xtrabackup_info 等文件,备份实现。

能够看到咱们备份的时候是卡在这一步:

FLUSH NO_WRITE_TO_BINLOG TABLES

1)敞开所有关上的表,强制敞开所有正在应用的表, 不写入 binlog

ok,十分清晰,咱们从 2 个方面动手,写 | 读

场景 1: 模仿未提交的写事务是否阻塞备份

序号 - 工夫线 trx1 backup
1 begin;
2 update tb1 set id =10 where id =1;
3 /usr/bin/innobackupex –defaults-file=/etc/my.cnf –user=bigdata_admin –password=123456 –socket=/mysqlData/data/mysql.sock –port=3306 /data/backup/
4 220425 15:13:09 completed OK!

备份输入的日志:

220425 15:13:08 [01] Copying ./db1/db.opt to /data/backup//2022-04-25_15-12-25/db1/db.opt
220425 15:13:08 [01]        ...done
220425 15:13:08 [01] Copying ./db1/tb1.frm to /data/backup//2022-04-25_15-12-25/db1/tb1.frm
220425 15:13:08 [01]        ...done
220425 15:13:08 Finished backing up non-InnoDB tables and files
220425 15:13:08 [00] Writing xtrabackup_binlog_info
220425 15:13:08 [00]        ...done
220425 15:13:08 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
xtrabackup: The latest check point (for incremental): '2447317'
xtrabackup: Stopping log copying thread.
.220425 15:13:08 >> log scanned up to (2447326)

220425 15:13:08 Executing UNLOCK TABLES
220425 15:13:08 All tables unlocked
220425 15:13:08 [00] Copying ib_buffer_pool to /data/backup//2022-04-25_15-12-25/ib_buffer_pool
220425 15:13:08 [00]        ...done
220425 15:13:08 Backup created in directory '/data/backup//2022-04-25_15-12-25'
MySQL binlog position: filename 'mysql-bin.000003', position '1032', GTID of the last change '74fd0ba9-c45b-11ec-97a4-525400abe2c0:1-8'
220425 15:13:08 [00] Writing backup-my.cnf
220425 15:13:08 [00]        ...done
220425 15:13:08 [00] Writing xtrabackup_info
220425 15:13:08 [00]        ...done
xtrabackup: Transaction log of lsn (2447317) to (2447326) was copied.
220425 15:13:09 completed OK!

论断:这个没提交的写事务是不会阻塞备份

场景 2: 模仿事务中的慢查问是否阻塞备份

序号 - 工夫线 trx1 backup
1 begin;
2 select id,sleep(1000) from tb1;
3 /usr/bin/innobackupex –defaults-file=/etc/my.cnf –user=bigdata_admin –password=123456 –socket=/mysqlData/data/mysql.sock –port=3306 /data/backup/
4 220425 15:20:58 [01] Copying ./db1/tb1.ibd to /data/backup//2022-04-25_15-20-21/db1/tb1.ibd<br/>220425 15:20:58 [01] …done<br/>220425 15:20:58 >> log scanned up to (2447326)<br/>220425 15:20:59 Executing FLUSH NO_WRITE_TO_BINLOG TABLES…<br/>220425 15:20:59 >> log scanned up to (2447326)<br/>220425 15:21:00 >> log scanned up to (2447326)<br/>220425 15:21:01 >> log scanned up to (2447326)<br/>220425 15:21:02 >> log scanned up to (2447326)<br/>220425 15:21:03 >> log scanned up to (2447326)<br/>220425 15:21:04 >> log scanned up to (2447326)<br/>220425 15:21:05 >> log scanned up to (2447326)<br/>220425 15:21:06 >> log scanned up to (2447326)<br/>220425 15:21:07 >> log scanned up to (2447326)<br/>220425 15:21:08 >> log scanned up to (2447326)<br/>220425 15:21:09 >> log scanned up to (2447326)<br/>220425 15:21:10 >> log scanned up to (2447326)<br/>220425 15:21:11 >> log scanned up to (2447326)<br/>220425 15:21:12 >> log scanned up to (2447326)<br/>220425 15:21:13 >> log scanned up to (2447326)<br/>220425 15:21:14 >> log scanned up to (2447326)<br/>220425 15:21:15 >> log scanned up to (2447326)
5 localhost 15:12:19 [db1]> select id,sleep(1000) from tb1;<br/>ERROR 3024 (HY000): Query execution was interrupted, maximum statement execution time exceeded
6 220425 15:22:18 [00] Writing xtrabackup_binlog_info<br/>220425 15:22:18 [00] …done<br/>220425 15:22:18 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS…<br/>xtrabackup: The latest check point (for incremental): ‘2447317’<br/>xtrabackup: Stopping log copying thread.<br/>.220425 15:22:18 >> log scanned up to (2447326)<br/><br/>220425 15:22:18 Executing UNLOCK TABLES<br/>220425 15:22:18 All tables unlocked<br/>220425 15:22:18 [00] Copying ib_buffer_pool to /data/backup//2022-04-25_15-20-21/ib_buffer_pool<br/>220425 15:22:18 [00] …done<br/>220425 15:22:18 Backup created in directory ‘/data/backup//2022-04-25_15-20-21′<br/>MySQL binlog position: filename ‘mysql-bin.000003’, position ‘1032’, GTID of the last change ’74fd0ba9-c45b-11ec-97a4-525400abe2c0:1-8′<br/>220425 15:22:18 [00] Writing backup-my.cnf<br/>220425 15:22:18 [00] …done<br/>220425 15:22:18 [00] Writing xtrabackup_info<br/>220425 15:22:18 [00] …done<br/>xtrabackup: Transaction log of lsn (2447317) to (2447326) was copied.<br/>220425 15:22:18 completed OK!

咱们能够看到,直到 sql 执行超时,也就是意味着表敞开了,备份才胜利。

论断:事务中的慢查问会阻塞备份

场景 3:模仿不带事务慢查问是否阻塞备份

序号 - 工夫线 trx-1 backup
1 select id,sleep(1000) from tb1;
2 /usr/bin/innobackupex –defaults-file=/etc/my.cnf –user=bigdata_admin –password=123456 –socket=/mysqlData/data/mysql.sock –port=3306 /data/backup/
3 220425 15:26:03 [01] …done<br/>220425 15:26:03 [01] Copying ./db1/tb1.ibd to /data/backup//2022-04-25_15-25-27/db1/tb1.ibd<br/>220425 15:26:03 [01] …done<br/>220425 15:26:03 Executing FLUSH NO_WRITE_TO_BINLOG TABLES…<br/>220425 15:26:04 >> log scanned up to (2447326)<br/>220425 15:26:05 >> log scanned up to (2447326)<br/>220425 15:26:06 >> log scanned up to (2447326)<br/>220425 15:26:07 >> log scanned up to (2447326)<br/>220425 15:26:08 >> log scanned up to (2447326)<br/>220425 15:26:09 >> log scanned up to (2447326)
4 localhost 15:22:17 [db1]> select id,sleep(1000) from tb1;<br/>ERROR 3024 (HY000): Query execution was interrupted, maximum statement execution time exceeded
5 220425 15:27:23 [00] …done<br/>220425 15:27:23 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS…<br/>xtrabackup: The latest check point (for incremental): ‘2447317’<br/>xtrabackup: Stopping log copying thread.<br/>.220425 15:27:23 >> log scanned up to (2447326)<br/><br/>220425 15:27:23 Executing UNLOCK TABLES<br/>220425 15:27:23 All tables unlocked<br/>220425 15:27:23 [00] Copying ib_buffer_pool to /data/backup//2022-04-25_15-25-27/ib_buffer_pool<br/>220425 15:27:23 [00] …done<br/>220425 15:27:23 Backup created in directory ‘/data/backup//2022-04-25_15-25-27′<br/>MySQL binlog position: filename ‘mysql-bin.000003’, position ‘1032’, GTID of the last change ’74fd0ba9-c45b-11ec-97a4-525400abe2c0:1-8′<br/>220425 15:27:23 [00] Writing backup-my.cnf<br/>220425 15:27:23 [00] …done<br/>220425 15:27:23 [00] Writing xtrabackup_info<br/>220425 15:27:23 [00] …done<br/>xtrabackup: Transaction log of lsn (2447317) to (2447326) was copied.<br/>220425 15:27:23 completed OK!
6

咱们能够看到,直到 sql 执行超时,也就是意味着表敞开了,备份拿到了备份锁,备份才胜利。对于备份锁的材料请查看这里:https://www.percona.com/doc/p…

论断:不带事务的慢查问也会阻塞备份

FLUSH TABLES

FLUSH NO_WRITE_TO_BINLOG TABLES = FLUSH TABLES,只不过这种 flush tables 的动作是不会写入到 binlog 文件中,而 flush tables 是须要获取锁。官网文件这样形容 flush tables

The FLUSH statement has several variant forms that clear or reload various internal caches, flush tables, or acquire locks.
By default, the server writes FLUSH statements to the binary log so that they replicate to replicas. To suppress logging, specify the optional NO_WRITE_TO_BINLOG keyword or its alias LOCAL.

FLUSH TABLES flushes tables, and, depending on the variant used, acquires locks. Any TABLES variant used in a FLUSH statement must be the only option used. FLUSH TABLE is a synonym for FLUSH TABLES.

Closes all open tables, forces all tables in use to be closed, and flushes the query cache and prepared statement cache. FLUSH TABLES also removes all query results from the query cache, like the RESET QUERY CACHE statement. For information about query caching and prepared statement caching

大略意思就是:敞开所有关上的表,强制敞开所有正在应用的表,并刷新查问缓存和筹备好的语句缓存。FLUSH TABLES还会从查问缓存中删除所有查问后果,例如 RESET QUERY CACHE语句。无关查问缓存和筹备好的语句缓存的信息

反推生产环境

查看慢查问

Oh my god,在 32.197 服务器的慢查问中搜寻到一个执行了 10 几个小时的慢查问

# Time: 2022-04-25T09:12:53.416049+08:00
# User@Host: BlueMoon[BlueMoon] @  [192.168.50.144]  Id: 55128898
# Query_time: 38910.325000  Lock_time: 0.000328 Rows_sent: 0  Rows_examined: 14637241
SET timestamp=1650849173;
SELECT
        d.coupon_master_code
FROM
        mam_coupon_activity_interact a
        LEFT JOIN mam_coupon_send_log b ON a.id = b.interact_id
        LEFT JOIN mam_user_coupon_package c ON c.coupon_user_code = b.coupon_user_code
        left join
        mam_coupon_info  d on c.coupon_info_id = d.id
WHERE
        a.interact_type = 3
GROUP BY
        d.coupon_master_code;

38910 秒 =10.8 小时,快 11 个钟了,Time: 2022-04-25T09:12:53,往前推 38910 秒,也就是 2022-04-24 22:24:23 开始执行这个 sql。

为什么会这么慢,咱们看下执行计

表构造如下:

mysql> show create table mam_coupon_send_log \G
*************************** 1. row ***************************
       Table: mam_coupon_send_log
Create Table: CREATE TABLE `mam_coupon_send_log` (`id` int(10) NOT NULL AUTO_INCREMENT COMMENT '主键',
  `last_update_time` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最初更新工夫',
  `operator_name` varchar(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '操作人名称',
  `operator_id` int(11) DEFAULT NULL COMMENT '操作人编号',
  `create_time` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创立工夫',
  `creator_name` varchar(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '创建人名称',
  `creator_id` int(11) DEFAULT NULL COMMENT '创建人编号',
  `del_flag` tinyint(1) DEFAULT '0' COMMENT '是否删除(0:否,1:是)',
  `shop_order_id` bigint(20) DEFAULT NULL COMMENT '订单编号',
  `activity_code` bigint(20) DEFAULT NULL COMMENT '流动编号',
  `coupon_user_code` bigint(20) DEFAULT NULL COMMENT '用户券编号',
  `user_id` bigint(20) DEFAULT NULL COMMENT '领券人 id',
  `mensend_id` int(11) DEFAULT NULL COMMENT '人工发券 id',
  `interact_id` bigint(20) DEFAULT NULL COMMENT '互动赠券 id',
  `send_type` tinyint(1) DEFAULT NULL COMMENT '发券类型(1 人工发券 2 积分兑券 3 买单赠券 4 数据迁徙 5 券转增 6 券转换 7 营销互动)',
  `act_id` int(11) DEFAULT NULL COMMENT '流动 id',
  `integral_record_id` bigint(20) DEFAULT NULL COMMENT '积分兑换记录表 id',
  `interact_detail_id` bigint(20) DEFAULT NULL COMMENT '互动赠券明细 id',
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_coupon_use_code` (`coupon_user_code`) USING BTREE COMMENT '卡券包编码索引',
  KEY `idx_shop_order_id_send_type` (`shop_order_id`,`send_type`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=551413 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC COMMENT='用户券发送日志表'

倡议对 interact_id 字段加索引

alter table mam_coupon_send_log add index idx_interact_id(interact_id);

而咱们那点备份是从早晨 1 点开始备份,也就是说这个备份始终在期待这个查问完结, 合乎场景 3:

查看数据库的 sql 超时是否设置

mysql> show variables like '%exec%';
+----------------------------------+--------+
| Variable_name                    | Value  |
+----------------------------------+--------+
| gtid_executed_compression_period | 1000   |
| max_execution_time               | 0      |
| rbr_exec_mode                    | STRICT |
| slave_exec_mode                  | STRICT |
+----------------------------------+--------+

。。。。。竟然没有设置 sql 查问超时工夫

倡议设置 sql 超时工夫

set global max_execution_time = 120000;
120 秒超时

反推备份优化

设置超时工夫

https://www.percona.com/doc/p…

--ftwrl-wait-timeout=SECONDS
This option specifies time in seconds that innobackupex should wait for queries that would block FLUSH TABLES WITH READ LOCK before running it. If there are still such queries when the timeout expires, innobackupex terminates with an error. Default is 0, in which case innobackupex does not wait for queries to complete and starts FLUSH TABLES WITH READ LOCK immediately. Where supported (Percona Server 5.6+) xtrabackup will automatically use Backup Locks as a lightweight alternative to FLUSH TABLES WITH READ LOCK to copy non-InnoDB data to avoid blocking DML queries that modify InnoDB tables.

--kill-long-queries-timeout=SECONDS
This option specifies the number of seconds innobackupex waits between starting FLUSH TABLES WITH READ LOCK and killing those queries that block it. Default is 0 seconds, which means innobackupex will not attempt to kill any queries. In order to use this option xtrabackup user should have PROCESS and SUPER privileges. Where supported (Percona Server 5.6+) xtrabackup will automatically use Backup Locks as a lightweight alternative to FLUSH TABLES WITH READ LOCK to copy non-InnoDB data to avoid blocking DML queries that modify InnoDB tables.

/usr/bin/innobackupex --defaults-file=/etc/my.cnf --user=bigdata_admin --password=123456 --socket=/mysqlData/data/mysql.sock --port=3306 --ftwrl-wait-timeout=5 --kill-long-queries-timeout=5 /data/backup/

总结

1. 业务层面:须要定期优化慢查问,业务库的增删改查操作尽量优化到 <0.2 秒

2. 数据库层面:设置慢查问执行超时工夫,超过了肯定工夫强制 kill 掉,比方查过 20 分钟,对于 OLTP 业务,如果一个 sql 20 分钟没有执行完,那就不是 OLTP,而是 OLAP 业务 , 应该放在大数据层解决或优化代码或 sql

set global max_execution_time = 1200000;

3. 备份层面: 减少锁期待的超时工夫
–ftwrl-wait-timeout=5 –kill-long-queries-timeout=5

因为工夫仓促,可能有很多细节没有思考到,后续再优化。

正文完
 0