关于数据库:GreatSQL删除分区慢的跟踪

GreatSQL删除分区慢的跟踪

背景

某业务零碎，每天凌晨会删除分区表的一个分区(按天分区)，耗时较久，从最开始的30秒，缓缓变为1分钟+，影响到交易业务的失常进行。在测试环境进行了模仿，复现了删除分区慢的状况，本次基于GreatSQL8.0.25-17进行测试，官网mysql版本也存在雷同问题。

测试环境

$ mysql -h127.0.0.1 -P8025 -uroot -pmysql: [Warning] Using a password on the command line interface can be insecure.Welcome to the MySQL monitor.  Commands end with ; or \g.Your MySQL connection id is 10Server version: 8.0.25-17 GreatSQL, Release 17, Revision 4733775f703Copyright (c) 2000, 2021, Oracle and/or its affiliates.Oracle is a registered trademark of Oracle Corporation and/or itsaffiliates. Other names may be trademarks of their respectiveowners.Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.greatsql> select version();+-----------+| version() |+-----------+| 8.0.25-17 |+-----------+1 row in set (0.00 sec)greatsql> show variables like 'autocommit' ;+---------------+-------+| Variable_name | Value |+---------------+-------+| autocommit    | ON    |+---------------+-------+1 row in set (0.01 sec)greatsql> show variables like 'innodb_buffer_pool_size';+-------------------------+------------+| Variable_name           | Value      |+-------------------------+------------+| innodb_buffer_pool_size | 4294967296 |+-------------------------+------------+1 row in set (0.01 sec)greatsql> show variables like 'innodb_flush_log_at_trx_commit';+--------------------------------+-------+| Variable_name                  | Value |+--------------------------------+-------+| innodb_flush_log_at_trx_commit | 1     |+--------------------------------+-------+1 row in set (0.00 sec)greatsql> show variables like 'sync_binlog';+---------------+-------+| Variable_name | Value |+---------------+-------+| sync_binlog   | 1     |+---------------+-------+1 row in set (0.00 sec)

建表

CREATE TABLE `t_partition` (  `id` bigint NOT NULL AUTO_INCREMENT,  `ua` varchar(100) DEFAULT NULL,  `start_time` datetime NOT NULL,  PRIMARY KEY (`id`,`start_time`))  PARTITION BY RANGE (to_days(`start_time`))(PARTITION P20230129 VALUES LESS THAN (738915) ENGINE = InnoDB, PARTITION P20230130 VALUES LESS THAN (738916) ENGINE = InnoDB, PARTITION P20230131 VALUES LESS THAN (738917) ENGINE = InnoDB, PARTITION P20230201 VALUES LESS THAN (738918) ENGINE = InnoDB, PARTITION P20230202 VALUES LESS THAN (738919) ENGINE = InnoDB, PARTITION P20230203 VALUES LESS THAN (738920) ENGINE = InnoDB, PARTITION P20230204 VALUES LESS THAN (738921) ENGINE = InnoDB, PARTITION P20230302 VALUES LESS THAN (738947) ENGINE = InnoDB, PARTITION P20230303 VALUES LESS THAN (738948) ENGINE = InnoDB, PARTITION P20230304 VALUES LESS THAN (738949) ENGINE = InnoDB, PARTITION P20230305 VALUES LESS THAN (738950) ENGINE = InnoDB, PARTITION P20230306 VALUES LESS THAN (738951) ENGINE = InnoDB, PARTITION P20230307 VALUES LESS THAN (738952) ENGINE = InnoDB,  PARTITION P20230308 VALUES LESS THAN (738953) ENGINE = InnoDB, PARTITION P20230309 VALUES LESS THAN (738954) ENGINE = InnoDB, PARTITION P20230310 VALUES LESS THAN (738955) ENGINE = InnoDB,  PARTITION P20230311 VALUES LESS THAN (738956) ENGINE = InnoDB,  PARTITION P20230312 VALUES LESS THAN (738957) ENGINE = InnoDB, PARTITION P20230313 VALUES LESS THAN (738958) ENGINE = InnoDB, PARTITION P20230314 VALUES LESS THAN (738959) ENGINE = InnoDB, PARTITION P20230315 VALUES LESS THAN (738960) ENGINE = InnoDB, PARTITION P20230316 VALUES LESS THAN (738961) ENGINE = InnoDB, PARTITION P20230317 VALUES LESS THAN (738962) ENGINE = InnoDB,  PARTITION P20230318 VALUES LESS THAN (738963) ENGINE = InnoDB, PARTITION P20230319 VALUES LESS THAN (738964) ENGINE = InnoDB, PARTITION P20230320 VALUES LESS THAN (738965) ENGINE = InnoDB,  PARTITION P20230321 VALUES LESS THAN (738966) ENGINE = InnoDB,   PARTITION P20230322 VALUES LESS THAN (738967) ENGINE = InnoDB, PARTITION P20230323 VALUES LESS THAN (738968) ENGINE = InnoDB, PARTITION P20230324 VALUES LESS THAN (738969) ENGINE = InnoDB, PARTITION P20230325 VALUES LESS THAN (738970) ENGINE = InnoDB, PARTITION P20230326 VALUES LESS THAN (738971) ENGINE = InnoDB, PARTITION P20230327 VALUES LESS THAN (738972) ENGINE = InnoDB,  PARTITION P20230328 VALUES LESS THAN (738973) ENGINE = InnoDB, PARTITION p20230329 VALUES LESS THAN (738974) ENGINE = InnoDB, PARTITION p20230330 VALUES LESS THAN (738975) ENGINE = InnoDB, PARTITION p20230331 VALUES LESS THAN (738976) ENGINE = InnoDB, PARTITION p20230401 VALUES LESS THAN (738977) ENGINE = InnoDB, PARTITION p20230402 VALUES LESS THAN (738978) ENGINE = InnoDB, PARTITION p20230403 VALUES LESS THAN (738979) ENGINE = InnoDB);

插入数据

greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;Query OK, 1 row affected (0.01 sec)Records: 1  Duplicates: 0  Warnings: 0greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;Query OK, 1 row affected (0.00 sec)Records: 1  Duplicates: 0  Warnings: 0greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;Query OK, 1 row affected (0.01 sec)Records: 1  Duplicates: 0  Warnings: 0greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition; Query OK, 3 rows affected (0.01 sec)Records: 3  Duplicates: 0  Warnings: 0greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition; Query OK, 6 rows affected (0.00 sec)Records: 6  Duplicates: 0  Warnings: 0............... greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition; Query OK, 3145728 rows affected (35.68 sec)Records: 3145728  Duplicates: 0  Warnings: 0greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition; Query OK, 6291456 rows affected (1 min 11.51 sec)Records: 6291456  Duplicates: 0  Warnings: 0greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition; Query OK, 12582912 rows affected (2 min 26.74 sec)Records: 12582912  Duplicates: 0  Warnings: 0greatsql> select count(*) from t_partition; +----------+| count(*) |+----------+| 25165824 |+----------+1 row in set (0.50 sec)greatsql> select count(*) from t_partition partition(P20230310); +----------+| count(*) |+----------+| 25165824 |+----------+

向分区插入数据

insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-11' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-12' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-13' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-14' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-15' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-16' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-17' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-18' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-19' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-20' from t_partition partition(P20230310);  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-21' from t_partition partition(P20230310);   。。。。。。。。。。。greatsql>  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-19' from t_partition partition(P20230310); Query OK, 25165824 rows affected (5 min 17.92 sec)Records: 25165824  Duplicates: 0  Warnings: 0greatsql>  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-20' from t_partition partition(P20230310); Query OK, 25165824 rows affected (5 min 19.56 sec)Records: 25165824  Duplicates: 0  Warnings: 0greatsql>  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-21' from t_partition partition(P20230310); Query OK, 25165824 rows affected (5 min 27.27 sec)Records: 25165824  Duplicates: 0  Warnings: 0

更新数据

greatsql> Update t_partition set ua=concat(ua,'abc') where start_time='2023-03-19';Query OK, 25165824 rows affected (12 min 55.53 sec)Rows matched: 25165824  Changed: 25165824  Warnings: 0

删除分区s

greatsql> alter table t_partition drop partition P20230311;Query OK, 0 rows affected (13.68 sec)Records: 0 Duplicates: 0 Warnings: 0greatsql> alter table t_partition drop partition P20230312;Query OK, 0 rows affected (0.07 sec)Records: 0 Duplicates: 0 Warnings: 0

两个分区数据量是一样，但删除第一个分区耗时较长。

$ perf record -ag -p  11222  -o /mysqldb/perf_drop_part_mysql2.dataWarning:PID/TID switch overriding SYSTEM^C[ perf record: Woken up 41 times to write data ][ perf record: Captured and wrote 10.610 MB /mysqldb/perf_drop_part_mysql2.data (54351 samples) ]  Children      Self  Command  Shared Object        Symbol                                                   +   99.17%     0.00%  mysqld   libpthread-2.17.so   [.] start_thread                                         +   99.17%     0.00%  mysqld   mysqld               [.] pfs_spawn_thread                                     +   99.17%     0.00%  mysqld   mysqld               [.] handle_connection                                    +   99.17%     0.00%  mysqld   mysqld               [.] do_command                                           +   99.17%     0.00%  mysqld   mysqld               [.] dispatch_command                                     +   99.17%     0.00%  mysqld   mysqld               [.] dispatch_sql_command                                 +   99.16%     0.00%  mysqld   mysqld               [.] mysql_execute_command                                +   99.16%     0.00%  mysqld   mysqld               [.] Sql_cmd_alter_table::execute                         +   99.16%     0.00%  mysqld   mysqld               [.] mysql_alter_table                                    +   99.09%     0.00%  mysqld   mysqld               [.] mysql_inplace_alter_table                            +   98.56%     0.00%  mysqld   mysqld               [.] ha_innopart::commit_inplace_alter_partition          +   98.54%     0.00%  mysqld   mysqld               [.] alter_parts::prepare_or_commit_for_new               +   98.54%     0.00%  mysqld   mysqld               [.] alter_part_normal::try_commit                        +   98.53%     0.00%  mysqld   mysqld               [.] btr_drop_ahi_for_table                               +   98.52%     1.30%  mysqld   mysqld               [.] btr_drop_next_batch                                  +   97.20%     0.03%  mysqld   mysqld               [.] btr_search_drop_page_hash_when_freed                 +   96.34%     2.03%  mysqld   mysqld               [.] btr_search_drop_page_hash_index                      +   86.52%    52.25%  mysqld   mysqld               [.] ha_remove_all_nodes_to_page                          +   34.21%    34.15%  mysqld   mysqld               [.] ha_delete_hash_node                                  +    4.27%     2.68%  mysqld   mysqld               [.] rec_get_offsets_func                                 +    2.11%     2.10%  mysqld   mysqld               [.] ut_fold_binary                                       +    1.58%     1.58%  mysqld   mysqld               [.] rec_init_offsets                                     +    1.30%     1.30%  mysqld   mysqld               [.] rec_fold                                             +    0.57%     0.03%  mysqld   mysqld               [.] buf_page_get_gen                                     +    0.57%     0.00%  mysqld   mysqld               [.] execute_native_thread_routine                        +    0.56%     0.01%  mysqld   [kernel.kallsyms]    [k] system_call_fastpath

从零碎调用上看有大量的自适应hash相干的调用

重启敞开自适应hash

greatsql> show variables like '%hash%';+----------------------------------+-------+| Variable_name                    | Value |+----------------------------------+-------+| innodb_adaptive_hash_index       | OFF   || innodb_adaptive_hash_index_parts | 8     |+----------------------------------+-------+

批改配置文件，敞开自适应hash，依照下面的流程从新执行

greatsql> alter table t_partition drop partition P20230311;Query OK, 0 rows affected (0.08 sec)Records: 0 Duplicates: 0 Warnings: 0greatsql> alter table t_partition drop partition P20230312;Query OK, 0 rows affected (0.08 sec)Records: 0 Duplicates: 0 Warnings: 0

敞开自适应hash后，雷同的操作过程，删除第一个分区的工夫显著变短，删除每个分区的工夫基本上统一。

备注：innodb_adaptive_hash_index是全局变量，能够动静批改，不重启数据库。

测试后果汇总

自适应hash比照	第一次删分区	第二次删分区
innodb_buffer_pool_instances=8& innodb_adaptive_hash_index=on	13.68	0.07
innodb_buffer_pool_instances=8& innodb_adaptive_hash_index=off	0.08	0.08

源码剖析

// btr_drop_ahi_for_tablevoid btr_drop_ahi_for_table(dict_table_t *table) {  const ulint len = UT_LIST_GET_LEN(table->indexes);  if (len == 0) {    return;  }  const dict_index_t *indexes[MAX_INDEXES];  const page_size_t page_size(dict_table_page_size(table));  for (;;) {    ulint ref_count = 0;    const dict_index_t **end = indexes;    for (dict_index_t *index = table->first_index(); index != nullptr;         index = index->next()) {      if (ulint n_refs = index->search_info->ref_count) {        ut_ad(!index->disable_ahi);        ut_ad(index->is_committed());        ref_count += n_refs;        ut_ad(indexes + len > end);        *end++ = index;      }    }    ut_ad((indexes == end) == (ref_count == 0));    if (ref_count == 0) {      return;    }    btr_drop_next_batch(page_size, indexes, end);  // breakpoint      std::this_thread::yield();  }}// btr_drop_next_batchstatic void btr_drop_next_batch(const page_size_t &page_size,                                const dict_index_t **first,                                const dict_index_t **last) {  static constexpr unsigned batch_size = 1024;  std::vector<page_id_t> to_drop;  to_drop.reserve(batch_size);  for (ulint i = 0; i < srv_buf_pool_instances; ++i) {    to_drop.clear();    buf_pool_t *buf_pool = buf_pool_from_array(i);    mutex_enter(&buf_pool->LRU_list_mutex);    const buf_page_t *prev;    for (const buf_page_t *bpage = UT_LIST_GET_LAST(buf_pool->LRU);         bpage != nullptr; bpage = prev) {      prev = UT_LIST_GET_PREV(LRU, bpage);      ut_a(buf_page_in_file(bpage));      if (buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE ||          bpage->buf_fix_count > 0) {        continue;      }      const dict_index_t *block_index =          reinterpret_cast<const buf_block_t *>(bpage)->ahi.index;      /* index == nullptr means the page is no longer in AHI, so no need to      attempt freeing it */      if (block_index == nullptr) {        continue;      }      /* pages IO fixed for read have index == nullptr */      ut_ad(!bpage->was_io_fix_read());      if (std::find(first, last, block_index) != last) {        to_drop.emplace_back(bpage->id);        if (to_drop.size() == batch_size) {   // batch_size = 1024          break;        }      }    }    mutex_exit(&buf_pool->LRU_list_mutex);    for (const page_id_t &page_id : to_drop) {      btr_search_drop_page_hash_when_freed(page_id, page_size); // breakpoint    }  }}// btr_search_drop_page_hash_when_freedvoid btr_search_drop_page_hash_when_freed(const page_id_t &page_id,                                          const page_size_t &page_size) {  buf_block_t *block;  mtr_t mtr;  ut_d(export_vars.innodb_ahi_drop_lookups++);  mtr_start(&mtr);  /* If the caller has a latch on the page, then the caller must  have a x-latch on the page and it must have already dropped  the hash index for the page. Because of the x-latch that we  are possibly holding, we cannot s-latch the page, but must  (recursively) x-latch it, even though we are only reading. */  block = buf_page_get_gen(page_id, page_size, RW_X_LATCH, nullptr,                           Page_fetch::PEEK_IF_IN_POOL, UT_LOCATION_HERE, &mtr);  if (block) {    /* If AHI is still valid, page can't be in free state.    AHI is dropped when page is freed. */    ut_ad(!block->page.file_page_was_freed);    buf_block_dbg_add_level(block, SYNC_TREE_NODE_FROM_HASH);    dict_index_t *index = block->ahi.index;    if (index != nullptr) {      /* In all our callers, the table handle should      be open, or we should be in the process of      dropping the table (preventing eviction). */      ut_ad(index->table->n_ref_count > 0 || dict_sys_mutex_own());      btr_search_drop_page_hash_index(block);    }  }  mtr_commit(&mtr);}

函数逻辑阐明：

drop 分区和add分区都会清空所有分区的AHI信息，最耗时的如下循环每个分区调用函数->btr_drop_ahi_for_table  循环表（或者分区）中的每个索引，如果索引都没有用到AHI，则退出  循环innodb buffer中的每个实例，依据LRU链表循环每个page如果page建设了AHI信息，且是要删除表（或者分区）的相干索引  则放入drop vector容器中如果page没有建设AHI信息 则跳过如果drop verctor容器中填满1024个page则清理一次，循环每个page，调用函数->btr_search_drop_page_hash_index  计算page所在AHI构造的slot信息，以便找到对应的hash_table_t构造  循环page中所有的行    循环行中拜访到的索引字段（拜访模式），计算出fold信息填入到fold[]数组中    本循环中会通过函数rec_get_offsets进行字段偏移量的获取，为耗用CPU的函数      循环fold[]数组，一个fold代表一行数据,调用函数       ->ha_remove_all_nodes_to_page,为耗用CPU的函数         ->ha_chain_get_fist           依据fold信息找到hash构造的cell         循环本cell中的链表信息           如果行的地址在本要删除的page上，调用函数           ->ha_delete_hash_node，为耗费CPU的函数             进行链表和hash构造的保护每次解决完1024个page后，yeild线程被动放弃CPU，防止长期占用CPU，醒来后持续解决

drop 分区和add分区都会循环每个分区调用函数btr_drop_ahi_for_table 、btr_search_drop_page_hash_index清空所有分区及索引的的AHI信息，删除第1个分区的时ahi信息被清空，删第2个分区的时候buffer中曾经没有ahi信息了，所有删除第2个分区就很快了。

防止形式

针对以上起因，线上执行 drop必须恪守以下准则：

1、敞开AHI性能，不应用AHI带来的查问减速性能，须要先在测试环境进行业务测试，确保业务性能不受影响。

2、删除表的第一个分区时，外部会清理该表在每个buffer pool实例中对应的数据块页面，耗时较久，接着删其余分区耗时很小，倡议将每天一次的删除分区的操作改为每周或者每月批量执行删除分区的操作，并且须要在业务低峰期操作。

Enjoy GreatSQL :)

## 对于 GreatSQL

GreatSQL是由万里数据库保护的MySQL分支，专一于晋升MGR可靠性及性能，反对InnoDB并行查问个性，是实用于金融级利用的MySQL分支版本。

相干链接： GreatSQL社区 Gitee GitHub Bilibili

GreatSQL社区：

社区博客有奖征稿详情：https://greatsql.cn/thread-100-1-1.html

技术交换群：

微信：扫码增加GreatSQL社区助手微信好友，发送验证信息加群。