关于mysql:第41期MySQL-哈希分区表

提到分区表，个别依照范畴（range）来对数据拆分居多，以哈希来对数据拆分的场景相来说有肯定局限性，不具备标准化。接下来我用几个示例来讲讲 MySQL 哈希分区表的应用场景以及相干革新点。

对于哈希分区表，最艰深的办法就是 hash 单个字段，比方上面表 hash_t1（存有500W行记录），依照自增 ID 来做 HASH ，分区数目为 1024 ：

mysql:ytt_new> show create table hash_t1\G*************************** 1. row ***************************       Table: hash_t1Create Table: CREATE TABLE `hash_t1` (  `id` bigint unsigned NOT NULL AUTO_INCREMENT,  `r1` int DEFAULT NULL,  `log_date` date DEFAULT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=5000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci/*!50100 PARTITION BY HASH (`id`)PARTITIONS 1024 */1 row in set (0.00 sec)mysql:ytt_new> select count(*) from hash_t1;+----------+| count(*) |+----------+|  5000000 |+----------+1 row in set (2.43 sec)

表 hash_t1 的分区形式很好了解，依照 ID 来对数据进行 HASH 拆分，也就是依照分区数量求模，相似于 hash(mod(id,1024)) ，数据分布十分平均。

mysql:ytt_new> select max(table_rows),min(table_rows) from information_schema.partitions where table_name = 'hash_t1';+-----------------+-----------------+| max(table_rows) | min(table_rows) |+-----------------+-----------------+|            4883 |            4882 |+-----------------+-----------------+1 row in set (0.04 sec)

接下来思考以下几条 SQL 语句：

SQL 1：select count(*) from hash_t1 where id = 1;SQL 2：select count(*) from hash_t1 where id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);SQL 3：select count(*) from hash_t1 where  id <=1；SQL 4：select count(*) from hash_t1 where id <=15;

SQL 1 和 SQL 2 非常适合检索哈希分区表，SQL 3 和 SQL 4 就不太适宜。

SQL 1 的执行打算：对于哈希分区表来说为最优场景，能具体到某单个分区，过滤值为常量。

mysql:ytt_new> explain select count(*) from hash_t1 where id = 8\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: hash_t1   partitions: p8         type: constpossible_keys: PRIMARY          key: PRIMARY      key_len: 8          ref: const         rows: 1     filtered: 100.00        Extra: Using index1 row in set, 1 warning (0.00 sec)

SQL 2 针对字段 ID 的过滤条件为15个常量组合，具体到15个分区，比照总分区数来讲比例很小，也很优化。不过从执行打算来看，还有优化空间，能够思考扭转分区表的哈希形式，前面介绍。

SQL 2 的执行打算：

mysql:ytt_new> explain select count(*) from hash_t1 where id  in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: hash_t1   partitions: p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15         type: rangepossible_keys: PRIMARY          key: PRIMARY      key_len: 8          ref: NULL         rows: 15     filtered: 100.00        Extra: Using where; Using index1 row in set, 1 warning (0.00 sec)

SQL 3 、SQL 4 与 SQL 1 、SQL 2 实现的成果一样，不过却要扫描所有分区能力拿到后果。

来同样看下 SQL 3 执行打算：

mysql:ytt_new> explain select count(*) from hash_t1 where  id <=1\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: hash_t1   partitions: p0,p1,p2，...，p1021,p1022,p1023...

所以须要留神的是哈希分区表仅限于等值过滤检索，相似对一般表基于哈希索引的检索。

之前咱们有看到 SQL 2 扫描了不必要的分区，那是否缩小 SQL 2 扫描的分区数量呢？答案是能够的。

得从新对表数据进行哈希拆分，由需要到定义反着来：

创立一张新表 hash_t2 , 依照 Id div 1024 来分区，每个分区就能严格依照 ID 程序寄存前 1024 个值：

mysql:ytt_new> create table hash_t2 (id bigint unsigned auto_increment primary key, r1 int, log_date date) partition by hash(id div 1024) partitions 1024;Query OK, 0 rows affected (10.54 sec)mysql:ytt_new>load data infile '/var/lib/mysql-files/hash_sample.csv' into table hash_t2;Query OK, 5000000 rows affected (3 min 20.11 sec)Records: 5000000  Deleted: 0  Skipped: 0  Warnings: 0

来看看成果：此时 SQL 2 能够基于单个分区 p0 来检索数据。

mysql:ytt_new> explain select count(*) from hash_t2 where id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: hash_t2   partitions: p0         type: rangepossible_keys: PRIMARY          key: PRIMARY      key_len: 8          ref: NULL         rows: 15     filtered: 100.00        Extra: Using where; Using index1 row in set, 1 warning (0.00 sec)

此外，哈希分区还适宜特定的日期类等值查问场景，分区定义比依照范畴要简略，实现成果一样。比方依照日期来检索的 SQL 5 ：

SQL 5： select count(*) from hash_t1 where log_date= '2020-08-05';

创立新表 hash_t3 ，分区字段为 year(log_date) ：

mysql:ytt_new>create table hash_t3 (id bigint unsigned , r1 int,log_date date, key idx_log_date(log_date));Query OK, 0 rows affected (0.04 sec)mysql:ytt_new>alter table hash_t3 partition by hash(year(log_date)) partitions 11;Query OK, 0 rows affected (0.32 sec)Records: 0  Duplicates: 0  Warnings: 0mysql:ytt_new>load data infile '/var/lib/mysql-files/hash_sample.csv' into table hash_t3;Query OK, 5000000 rows affected (2 min 4.59 sec)Records: 5000000  Deleted: 0  Skipped: 0  Warnings: 0

定义好新表，来看下 SQL 5 的执行打算：基于日期的检索也能限定在单个分区。

mysql:ytt_new>explain  select count(*) from hash_t3 where log_date = '2020-08-05'\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: hash_t3   partitions: p7         type: rpossible_keys: idx_log_date          key: idx_log_date      key_len: 4          ref: const         rows: 1405     filtered: 100.00        Extra: Using index1 row in set, 1 warning (0.00 sec)

这里用 year 还不够优化，前期随着数据量减少，每个分区的数据会有增长，能够思考依照 month 来拆分数据。

MySQL 还有一个非凡的哈希分区：不限度输出数据类型的KEY 分区，哈希函数预约义为零碎函数 PASSWORD ，定义更加简略，适宜主键非整型字段的表。

以上这些都只是思考查问性能，如果前期分区常常扩容，缩容等，能够思考线性哈希分区。

线性哈希是一致性哈希在 MySQL 里的具体实现，其目标就是为了解决分区表前期扩缩容性能问题。不过会带来分区数据分布不平均、呈现数据热点、雷同 SQL 扫描记录数被放大等新问题。

用一个简略例子来比照下两者的差别：把表 hash_t1 分区数量缩减到10个，总破费工夫2分钟46秒：

mysql:ytt_new>alter table hash_t1 coalesce partition 1014;Query OK, 0 rows affected (2 min 46.01 sec)Records: 0  Duplicates: 0  Warnings: 0

新建线性哈希表 hash_linear_t1 ，初始分区数目也是 1024 ，同样把分区数目缩小到10个。缩减分区的工夫为1分钟28秒，比操作表 hash_t1 工夫上少了一半左右。

mysql:ytt_new>create table hash_linear_t1 (id bigint unsigned auto_increment primary key, r1 int,log_date date) partition by linear hash(id) partitions 1024;Query OK, 0 rows affected (34.13 sec)mysql:ytt_new>load data infile '/var/lib/mysql-files/hash_sample.csv' into table hash_linear_t1 ;Query OK, 5000000 rows affected (2 min 7.78 sec)Records: 5000000  Deleted: 0  Skipped: 0  Warnings: 0mysql:ytt_new>alter table hash_linear_t1 coalesce partition 1014;Query OK, 0 rows affected (1 min 28.29 sec)Records: 0  Duplicates: 0  Warnings: 0

来看下两张分区表的数据分布状况：很显著，线性哈希表数据分布不是很平均，存在重大的数据热点问题。

mysql:ytt_new>select table_rows from information_schema.partitions where table_name = 'hash_t1';+------------+| TABLE_ROWS |+------------+|     485723 ||     537704 ||     523017 ||     470724 ||     478982 ||     512272 ||     483190 ||     455829 ||     520512 ||     461572 |+------------+10 rows in set (0.00 sec)mysql:ytt_new>select table_rows from information_schema.partitions where table_name = 'hash_linear_t1 ';+------------+| TABLE_ROWS |+------------+|     269443 ||     340989 ||     611739 ||     584321 ||     566181 ||     624040 ||     637801 ||     688467 ||     331397 ||     317695 |+------------+10 rows in set (0.01 sec)

本篇介绍了 MySQL 哈希分区表的应用场景以及一些轻微差别。切记：哈希分区不能用于范畴查问，只能用作等值查问场景。