GreatSQL 删除分区慢的跟踪
背景
某业务零碎,每天凌晨会删除分区表的一个分区(按天分区),耗时较久,从最开始的 30 秒,缓缓变为 1 分钟 +,影响到交易业务的失常进行。在测试环境进行了模仿,复现了删除分区慢的状况,本次基于 GreatSQL8.0.25-17 进行测试,官网 mysql 版本也存在雷同问题。
测试环境
$ mysql -h127.0.0.1 -P8025 -uroot -p
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 10
Server version: 8.0.25-17 GreatSQL, Release 17, Revision 4733775f703
Copyright (c) 2000, 2021, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
greatsql> select version();
+-----------+
| version() |
+-----------+
| 8.0.25-17 |
+-----------+
1 row in set (0.00 sec)
greatsql> show variables like 'autocommit' ;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| autocommit | ON |
+---------------+-------+
1 row in set (0.01 sec)
greatsql> show variables like 'innodb_buffer_pool_size';
+-------------------------+------------+
| Variable_name | Value |
+-------------------------+------------+
| innodb_buffer_pool_size | 4294967296 |
+-------------------------+------------+
1 row in set (0.01 sec)
greatsql> show variables like 'innodb_flush_log_at_trx_commit';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| innodb_flush_log_at_trx_commit | 1 |
+--------------------------------+-------+
1 row in set (0.00 sec)
greatsql> show variables like 'sync_binlog';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| sync_binlog | 1 |
+---------------+-------+
1 row in set (0.00 sec)
建表
CREATE TABLE `t_partition` (
`id` bigint NOT NULL AUTO_INCREMENT,
`ua` varchar(100) DEFAULT NULL,
`start_time` datetime NOT NULL,
PRIMARY KEY (`id`,`start_time`)
) PARTITION BY RANGE (to_days(`start_time`))
(PARTITION P20230129 VALUES LESS THAN (738915) ENGINE = InnoDB,
PARTITION P20230130 VALUES LESS THAN (738916) ENGINE = InnoDB,
PARTITION P20230131 VALUES LESS THAN (738917) ENGINE = InnoDB,
PARTITION P20230201 VALUES LESS THAN (738918) ENGINE = InnoDB,
PARTITION P20230202 VALUES LESS THAN (738919) ENGINE = InnoDB,
PARTITION P20230203 VALUES LESS THAN (738920) ENGINE = InnoDB,
PARTITION P20230204 VALUES LESS THAN (738921) ENGINE = InnoDB,
PARTITION P20230302 VALUES LESS THAN (738947) ENGINE = InnoDB,
PARTITION P20230303 VALUES LESS THAN (738948) ENGINE = InnoDB,
PARTITION P20230304 VALUES LESS THAN (738949) ENGINE = InnoDB,
PARTITION P20230305 VALUES LESS THAN (738950) ENGINE = InnoDB,
PARTITION P20230306 VALUES LESS THAN (738951) ENGINE = InnoDB,
PARTITION P20230307 VALUES LESS THAN (738952) ENGINE = InnoDB,
PARTITION P20230308 VALUES LESS THAN (738953) ENGINE = InnoDB,
PARTITION P20230309 VALUES LESS THAN (738954) ENGINE = InnoDB,
PARTITION P20230310 VALUES LESS THAN (738955) ENGINE = InnoDB,
PARTITION P20230311 VALUES LESS THAN (738956) ENGINE = InnoDB,
PARTITION P20230312 VALUES LESS THAN (738957) ENGINE = InnoDB,
PARTITION P20230313 VALUES LESS THAN (738958) ENGINE = InnoDB,
PARTITION P20230314 VALUES LESS THAN (738959) ENGINE = InnoDB,
PARTITION P20230315 VALUES LESS THAN (738960) ENGINE = InnoDB,
PARTITION P20230316 VALUES LESS THAN (738961) ENGINE = InnoDB,
PARTITION P20230317 VALUES LESS THAN (738962) ENGINE = InnoDB,
PARTITION P20230318 VALUES LESS THAN (738963) ENGINE = InnoDB,
PARTITION P20230319 VALUES LESS THAN (738964) ENGINE = InnoDB,
PARTITION P20230320 VALUES LESS THAN (738965) ENGINE = InnoDB,
PARTITION P20230321 VALUES LESS THAN (738966) ENGINE = InnoDB,
PARTITION P20230322 VALUES LESS THAN (738967) ENGINE = InnoDB,
PARTITION P20230323 VALUES LESS THAN (738968) ENGINE = InnoDB,
PARTITION P20230324 VALUES LESS THAN (738969) ENGINE = InnoDB,
PARTITION P20230325 VALUES LESS THAN (738970) ENGINE = InnoDB,
PARTITION P20230326 VALUES LESS THAN (738971) ENGINE = InnoDB,
PARTITION P20230327 VALUES LESS THAN (738972) ENGINE = InnoDB,
PARTITION P20230328 VALUES LESS THAN (738973) ENGINE = InnoDB,
PARTITION p20230329 VALUES LESS THAN (738974) ENGINE = InnoDB,
PARTITION p20230330 VALUES LESS THAN (738975) ENGINE = InnoDB,
PARTITION p20230331 VALUES LESS THAN (738976) ENGINE = InnoDB,
PARTITION p20230401 VALUES LESS THAN (738977) ENGINE = InnoDB,
PARTITION p20230402 VALUES LESS THAN (738978) ENGINE = InnoDB,
PARTITION p20230403 VALUES LESS THAN (738979) ENGINE = InnoDB);
插入数据
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
Query OK, 3 rows affected (0.01 sec)
Records: 3 Duplicates: 0 Warnings: 0
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0
...............
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
Query OK, 3145728 rows affected (35.68 sec)
Records: 3145728 Duplicates: 0 Warnings: 0
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
Query OK, 6291456 rows affected (1 min 11.51 sec)
Records: 6291456 Duplicates: 0 Warnings: 0
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
Query OK, 12582912 rows affected (2 min 26.74 sec)
Records: 12582912 Duplicates: 0 Warnings: 0
greatsql> select count(*) from t_partition;
+----------+
| count(*) |
+----------+
| 25165824 |
+----------+
1 row in set (0.50 sec)
greatsql> select count(*) from t_partition partition(P20230310);
+----------+
| count(*) |
+----------+
| 25165824 |
+----------+
向分区插入数据
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-11' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-12' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-13' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-14' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-15' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-16' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-17' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-18' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-19' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-20' from t_partition partition(P20230310);
insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-21' from t_partition partition(P20230310);。。。。。。。。。。。greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-19' from t_partition partition(P20230310);
Query OK, 25165824 rows affected (5 min 17.92 sec)
Records: 25165824 Duplicates: 0 Warnings: 0
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-20' from t_partition partition(P20230310);
Query OK, 25165824 rows affected (5 min 19.56 sec)
Records: 25165824 Duplicates: 0 Warnings: 0
greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-21' from t_partition partition(P20230310);
Query OK, 25165824 rows affected (5 min 27.27 sec)
Records: 25165824 Duplicates: 0 Warnings: 0
更新数据
greatsql> Update t_partition set ua=concat(ua,'abc') where start_time='2023-03-19';
Query OK, 25165824 rows affected (12 min 55.53 sec)
Rows matched: 25165824 Changed: 25165824 Warnings: 0
删除分区 s
greatsql> alter table t_partition drop partition P20230311;
Query OK, 0 rows affected (13.68 sec)
Records: 0 Duplicates: 0 Warnings: 0
greatsql> alter table t_partition drop partition P20230312;
Query OK, 0 rows affected (0.07 sec)
Records: 0 Duplicates: 0 Warnings: 0
两个分区数据量是一样,但删除第一个分区耗时较长。
$ perf record -ag -p 11222 -o /mysqldb/perf_drop_part_mysql2.data
Warning:
PID/TID switch overriding SYSTEM
^C[perf record: Woken up 41 times to write data]
[perf record: Captured and wrote 10.610 MB /mysqldb/perf_drop_part_mysql2.data (54351 samples) ]
Children Self Command Shared Object Symbol
+ 99.17% 0.00% mysqld libpthread-2.17.so [.] start_thread
+ 99.17% 0.00% mysqld mysqld [.] pfs_spawn_thread
+ 99.17% 0.00% mysqld mysqld [.] handle_connection
+ 99.17% 0.00% mysqld mysqld [.] do_command
+ 99.17% 0.00% mysqld mysqld [.] dispatch_command
+ 99.17% 0.00% mysqld mysqld [.] dispatch_sql_command
+ 99.16% 0.00% mysqld mysqld [.] mysql_execute_command
+ 99.16% 0.00% mysqld mysqld [.] Sql_cmd_alter_table::execute
+ 99.16% 0.00% mysqld mysqld [.] mysql_alter_table
+ 99.09% 0.00% mysqld mysqld [.] mysql_inplace_alter_table
+ 98.56% 0.00% mysqld mysqld [.] ha_innopart::commit_inplace_alter_partition
+ 98.54% 0.00% mysqld mysqld [.] alter_parts::prepare_or_commit_for_new
+ 98.54% 0.00% mysqld mysqld [.] alter_part_normal::try_commit
+ 98.53% 0.00% mysqld mysqld [.] btr_drop_ahi_for_table
+ 98.52% 1.30% mysqld mysqld [.] btr_drop_next_batch
+ 97.20% 0.03% mysqld mysqld [.] btr_search_drop_page_hash_when_freed
+ 96.34% 2.03% mysqld mysqld [.] btr_search_drop_page_hash_index
+ 86.52% 52.25% mysqld mysqld [.] ha_remove_all_nodes_to_page
+ 34.21% 34.15% mysqld mysqld [.] ha_delete_hash_node
+ 4.27% 2.68% mysqld mysqld [.] rec_get_offsets_func
+ 2.11% 2.10% mysqld mysqld [.] ut_fold_binary
+ 1.58% 1.58% mysqld mysqld [.] rec_init_offsets
+ 1.30% 1.30% mysqld mysqld [.] rec_fold
+ 0.57% 0.03% mysqld mysqld [.] buf_page_get_gen
+ 0.57% 0.00% mysqld mysqld [.] execute_native_thread_routine
+ 0.56% 0.01% mysqld [kernel.kallsyms] [k] system_call_fastpath
从零碎调用上看有大量的自适应 hash 相干的调用
重启敞开自适应 hash
greatsql> show variables like '%hash%';
+----------------------------------+-------+
| Variable_name | Value |
+----------------------------------+-------+
| innodb_adaptive_hash_index | OFF |
| innodb_adaptive_hash_index_parts | 8 |
+----------------------------------+-------+
批改配置文件,敞开自适应 hash,依照下面的流程从新执行
greatsql> alter table t_partition drop partition P20230311;
Query OK, 0 rows affected (0.08 sec)
Records: 0 Duplicates: 0 Warnings: 0
greatsql> alter table t_partition drop partition P20230312;
Query OK, 0 rows affected (0.08 sec)
Records: 0 Duplicates: 0 Warnings: 0
敞开自适应 hash 后,雷同的操作过程,删除第一个分区的工夫显著变短,删除每个分区的工夫基本上统一。
备注:innodb_adaptive_hash_index 是全局变量,能够动静批改,不重启数据库。
测试后果汇总
自适应 hash 比照 | 第一次删分区 | 第二次删分区 |
---|---|---|
innodb_buffer_pool_instances=8& innodb_adaptive_hash_index=on | 13.68 | 0.07 |
innodb_buffer_pool_instances=8& innodb_adaptive_hash_index=off | 0.08 | 0.08 |
源码剖析
// btr_drop_ahi_for_table
void btr_drop_ahi_for_table(dict_table_t *table) {const ulint len = UT_LIST_GET_LEN(table->indexes);
if (len == 0) {return;}
const dict_index_t *indexes[MAX_INDEXES];
const page_size_t page_size(dict_table_page_size(table));
for (;;) {
ulint ref_count = 0;
const dict_index_t **end = indexes;
for (dict_index_t *index = table->first_index(); index != nullptr;
index = index->next()) {if (ulint n_refs = index->search_info->ref_count) {ut_ad(!index->disable_ahi);
ut_ad(index->is_committed());
ref_count += n_refs;
ut_ad(indexes + len > end);
*end++ = index;
}
}
ut_ad((indexes == end) == (ref_count == 0));
if (ref_count == 0) {return;}
btr_drop_next_batch(page_size, indexes, end); // breakpoint
std::this_thread::yield();}
}
// btr_drop_next_batch
static void btr_drop_next_batch(const page_size_t &page_size,
const dict_index_t **first,
const dict_index_t **last) {
static constexpr unsigned batch_size = 1024;
std::vector<page_id_t> to_drop;
to_drop.reserve(batch_size);
for (ulint i = 0; i < srv_buf_pool_instances; ++i) {to_drop.clear();
buf_pool_t *buf_pool = buf_pool_from_array(i);
mutex_enter(&buf_pool->LRU_list_mutex);
const buf_page_t *prev;
for (const buf_page_t *bpage = UT_LIST_GET_LAST(buf_pool->LRU);
bpage != nullptr; bpage = prev) {prev = UT_LIST_GET_PREV(LRU, bpage);
ut_a(buf_page_in_file(bpage));
if (buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE ||
bpage->buf_fix_count > 0) {continue;}
const dict_index_t *block_index =
reinterpret_cast<const buf_block_t *>(bpage)->ahi.index;
/* index == nullptr means the page is no longer in AHI, so no need to
attempt freeing it */
if (block_index == nullptr) {continue;}
/* pages IO fixed for read have index == nullptr */
ut_ad(!bpage->was_io_fix_read());
if (std::find(first, last, block_index) != last) {to_drop.emplace_back(bpage->id);
if (to_drop.size() == batch_size) { // batch_size = 1024
break;
}
}
}
mutex_exit(&buf_pool->LRU_list_mutex);
for (const page_id_t &page_id : to_drop) {btr_search_drop_page_hash_when_freed(page_id, page_size); // breakpoint
}
}
}
// btr_search_drop_page_hash_when_freed
void btr_search_drop_page_hash_when_freed(const page_id_t &page_id,
const page_size_t &page_size) {
buf_block_t *block;
mtr_t mtr;
ut_d(export_vars.innodb_ahi_drop_lookups++);
mtr_start(&mtr);
/* If the caller has a latch on the page, then the caller must
have a x-latch on the page and it must have already dropped
the hash index for the page. Because of the x-latch that we
are possibly holding, we cannot s-latch the page, but must
(recursively) x-latch it, even though we are only reading. */
block = buf_page_get_gen(page_id, page_size, RW_X_LATCH, nullptr,
Page_fetch::PEEK_IF_IN_POOL, UT_LOCATION_HERE, &mtr);
if (block) {
/* If AHI is still valid, page can't be in free state.
AHI is dropped when page is freed. */
ut_ad(!block->page.file_page_was_freed);
buf_block_dbg_add_level(block, SYNC_TREE_NODE_FROM_HASH);
dict_index_t *index = block->ahi.index;
if (index != nullptr) {
/* In all our callers, the table handle should
be open, or we should be in the process of
dropping the table (preventing eviction). */
ut_ad(index->table->n_ref_count > 0 || dict_sys_mutex_own());
btr_search_drop_page_hash_index(block);
}
}
mtr_commit(&mtr);
}
函数逻辑阐明:
drop 分区和 add 分区都会清空所有分区的 AHI 信息,最耗时的如下
循环每个分区调用函数
->btr_drop_ahi_for_table
循环表(或者分区)中的每个索引,如果索引都没有用到 AHI,则退出
循环 innodb buffer 中的每个实例,依据 LRU 链表循环每个 page
如果 page 建设了 AHI 信息,且是要删除表(或者分区)的相干索引
则放入 drop vector 容器中
如果 page 没有建设 AHI 信息
则跳过
如果 drop verctor 容器中填满 1024 个 page
则清理一次,循环每个 page,调用函数
->btr_search_drop_page_hash_index
计算 page 所在 AHI 构造的 slot 信息,以便找到对应的 hash_table_t 构造
循环 page 中所有的行
循环行中拜访到的索引字段(拜访模式),计算出 fold 信息填入到 fold[]数组中
本循环中会通过函数 rec_get_offsets 进行字段偏移量的获取,为耗用 CPU 的函数
循环 fold[]数组,一个 fold 代表一行数据, 调用函数
->ha_remove_all_nodes_to_page, 为耗用 CPU 的函数
->ha_chain_get_fist
依据 fold 信息找到 hash 构造的 cell
循环本 cell 中的链表信息
如果行的地址在本要删除的 page 上,调用函数
->ha_delete_hash_node,为耗费 CPU 的函数
进行链表和 hash 构造的保护
每次解决完 1024 个 page 后,yeild 线程被动放弃 CPU,防止长期占用 CPU,醒来后持续解决
drop 分区和 add 分区都会循环每个分区调用函数 btr_drop_ahi_for_table、btr_search_drop_page_hash_index 清空所有分区及索引的的 AHI 信息,删除第 1 个分区的时 ahi 信息被清空,删第 2 个分区的时候 buffer 中曾经没有 ahi 信息了,所有删除第 2 个分区就很快了。
防止形式
针对以上起因,线上执行 drop 必须恪守以下准则:
1、敞开 AHI 性能,不应用 AHI 带来的查问减速性能,须要先在测试环境进行业务测试,确保业务性能不受影响。
2、删除表的第一个分区时,外部会清理该表在每个 buffer pool 实例中对应的数据块页面,耗时较久,接着删其余分区耗时很小,倡议将每天一次的删除分区的操作改为每周或者每月批量执行删除分区的操作,并且须要在业务低峰期操作。
Enjoy GreatSQL :)
## 对于 GreatSQL
GreatSQL 是由万里数据库保护的 MySQL 分支,专一于晋升 MGR 可靠性及性能,反对 InnoDB 并行查问个性,是实用于金融级利用的 MySQL 分支版本。
相干链接:GreatSQL 社区 Gitee GitHub Bilibili
GreatSQL 社区:
社区博客有奖征稿详情:https://greatsql.cn/thread-100-1-1.html
技术交换群:
微信:扫码增加
GreatSQL 社区助手
微信好友,发送验证信息加群
。