关于mysql:第25期索引设计索引的基数与可选择性

这篇次要介绍 MySQL 索引的 Cardinality 值（基数）以及索引的可选择性。

索引基数值

索引基数的含意：

由索引中惟一值计算的一个预估值。索引基数值的精确水平间接影响到 MySQL 优化器基于此索引的查问打算是否精确高效。

与索引基数值最为亲密的典型场景就是：一条 SQL 在某一时刻执行比较慢，其中较为可能的起因就是以后表记录更新频繁，这条 SQL 执行打算走的索引基数值没及时更新，优化器抉择走备用索引或者走全表扫描，从而非最优执行打算，最终执行后果没有达到预期，总体查问工夫较慢，这时可能得手工更新索引的基数值。

索引的可选择性：

索引的可选择性好与坏，和索引基数关系十分亲密。基数值越高，索引的可选择性越好；相同，基数越低，索引的可选择性越差。优化器优先应用的索引个别选择性都不差，除非没得选，才会走选择性稍差点的索引或者走全表扫描。

影响索引基数值的相干指标：

表的 sample page 个数，也就是表样例数据页的个数，这个在之前表样例数据计算中具体讲过。
表的更新频率，一般来说，当 1/16 的数据页被更新过，就会自动更新索引基数。
索引列的数据分布水平，比方状态类，订单号，日期等。不同的数据分布，有不同的索引基数。
手动进行索引基数更新，比方 analyze table、show table status 等。

查看某个索引的基数值，有多种形式：

间接执行 show index from tablename。
查问数据字典表 information_schema.statstics。

举例

上面来举例说明索引基数在不同的数据分布场景下的变动以及对优化器的影响。

根底表构造如下：表 ytt_sample 有 7 个字段，5 个索引，其中主键的基数最大，可选择性最好，其余的索引要看数据的散布情况来定。

(localhost:mysqld.sock)|(ytt)>show create table ytt_sample\G*************************** 1. row ***************************       Table: ytt_sampleCreate Table: CREATE TABLE `ytt_sample` (  `id` int NOT NULL AUTO_INCREMENT,  `r1` int DEFAULT NULL,  `r2` int DEFAULT NULL,  `r3` int DEFAULT NULL,  `r4` int DEFAULT NULL,  `r5` tinyint DEFAULT NULL,  `r6` date DEFAULT NULL,  PRIMARY KEY (`id`),  KEY `idx_u1` (`r1`,`r2`,`r3`),  KEY `idx_r4` (`r4`),  KEY `idx_r5` (`r5`),  KEY `idx_r6` (`r6`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci1 row in set (0.00 sec)

这张表有 100 行记录。

(localhost:mysqld.sock)|(ytt)>select count(*) from ytt_sample;+----------+| count(*) |+----------+|      100 |+----------+1 row in set (0.03 sec)

检索数据字典，查看以后表索引基数排名：

(localhost:mysqld.sock)|(ytt)>select table_schema,table_name,index_name,cardinality from information_schema.statistics where table_schema='ytt' and table_name = 'ytt_sample' order by cardinality desc;+--------------+------------+------------+-------------+| TABLE_SCHEMA | TABLE_NAME | INDEX_NAME | CARDINALITY |+--------------+------------+------------+-------------+| ytt          | ytt_sample | PRIMARY    |         100 || ytt          | ytt_sample | idx_u1     |          97 || ytt          | ytt_sample | idx_u1     |          91 || ytt          | ytt_sample | idx_r6     |          85 || ytt          | ytt_sample | idx_u1     |          29 || ytt          | ytt_sample | idx_r4     |          20 || ytt          | ytt_sample | idx_r5     |          10 |+--------------+------------+------------+-------------+7 rows in set (0.02 sec)

从以上后果能够看出，主键基数最高，极限靠近于表记录数；联结索引 idx_u1 次之；索引 idx_r6 值为 85，也不差；比拟差的为 idx_r4、idx_r5，别离为 20、10，其中 idx_r5 最差，仅仅为表记录熟的 1/10。索引 idx_r5 相似于咱们常说的状态类索引，因为所以基数很低，优化器个别不抉择这个索引，个别不须要加，加了反而影响表的写性能。

联结索引 idx_u1 的基数是依照多个键值顺次组合计算，别离为(r1),(r1,r2),(r1,r2,r3)，以下为计算形式：

(localhost:mysqld.sock)|(ytt)>select count(distinct r1) idx_u1 from ytt_sample    -> union all    -> select count(distinct r1,r2) idx_u1 from ytt_sample    -> union all    -> select count(distinct r1,r2,r3) idx_u1 from ytt_sample;+--------+| idx_u1 |+--------+|     29 ||     91 ||     97 |+--------+3 rows in set (0.00 sec)

能够看出，联结索引的基数是 97,前两个字段的基数别离为 29 和 91。

接下来，用索引 idx_r4、idx_r5 举个例子，以下 SQL 1：

# SQL 1select * from ytt_sample where  r4 = 2 and r5 = 2;

以上 SQL 1 如果运行十分频繁，那就应该删掉两个单值索引 idx_r4、idx_r5，建设一个蕴含这两个字段的联结索引，不过在建设联结索引时，字段的程序十分重要。

独自查问 r4=2 的后果，有 4 条记录。

(localhost:mysqld.sock)|(ytt)>select * from ytt_sample where  r4 = 2 ;+-----+------+------+------+------+------+------------+| id  | r1   | r2   | r3   | r4   | r5   | r6         |+-----+------+------+------+------+------+------------+|   1 |    8 |   25 |   10 |    2 |    2 | 2021-03-12 ||  82 |    8 |   18 |    4 |    2 |    2 | 2021-04-30 ||  85 |    9 |    7 |    9 |    2 |    9 | 2021-03-12 || 135 |   22 |    6 |   23 |    2 |    4 | 2020-12-20 |+-----+------+------+------+------+------+------------+4 rows in set (0.00 sec)

独自查问 r5=2 的后果，有 11 条记录。

(localhost:mysqld.sock)|(ytt)>select * from ytt_sample where  r5 = 2 ;+-----+------+------+------+------+------+------------+| id  | r1   | r2   | r3   | r4   | r5   | r6         |+-----+------+------+------+------+------+------------+|   1 |    8 |   25 |   10 |    2 |    2 | 2021-03-12 ||  19 |   17 |   21 |   22 |    9 |    2 | 2020-10-11 ||  36 |   26 |   27 |   28 |   11 |    2 | 2020-10-29 ||  37 |   28 |   25 |   10 |   16 |    2 | 2020-10-21 ||  41 |   23 |   16 |   12 |   11 |    2 | 2021-01-29 ||  82 |    8 |   18 |    4 |    2 |    2 | 2021-04-30 || 131 |   26 |    9 |   27 |    9 |    2 | 2021-03-18 || 136 |    6 |   22 |    2 |    3 |    2 | 2020-08-20 || 143 |    8 |   25 |   10 |   19 |    2 | 2020-09-01 || 147 |   18 |   18 |    5 |   20 |    2 | 2020-10-10 || 156 |    8 |   13 |    8 |   17 |    2 | 2021-06-13 |+-----+------+------+------+------+------+------------+11 rows in set (0.00 sec)

所以这时程序曾经敲定了，数据起码的在后面，也就是字段 r4 在后面，联结索引应该这样：idx_u2(r4,r5)

此时最优的查问打算如下：

(localhost:mysqld.sock)|(ytt)>explain select * from ytt_sample where  r4 = 2 and r5 = 2\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: ytt_sample   partitions: NULL         type: refpossible_keys: idx_u2          key: idx_u2      key_len: 7          ref: const,const         rows: 2     filtered: 100.00        Extra: NULL1 row in set, 1 warning (0.00 sec)

那是不是只有建设联结索引 idx_u2，就一劳永逸了呢？也不肯定，如果前期字段 r4,r5 的数据分布有变动，可能索引 idx_u2 的程序也得改。所以说索引的建设和数据的散布关系十分大，也就是和索引的基数关系十分大。

那再看两条基于字段 r6 的 SQL 语句：

# SQL 2select * from ytt_sample where r6 between '2020-01-25' and '2021-03-12'# SQL 3select count(*) from ytt_sample where r6 between '2020-01-25' and '2021-03-12'

下面 SQL 2、SQL 3 两条 SQL 的过滤条件一样，都是一个范畴。不同的是 SQL 2 打印合乎过滤条件的记录，而 SQL 3 是打印符号过滤条件的记录条数。单从过滤条件来看，这两条 SQL 的执行打算应该一样，那别离看下两条 SQL 的执行打算：

# EXPLAIN SQL 2(localhost:mysqld.sock)|(ytt)>explain select * from ytt_sample where r6 between '2020-01-25' and '2021-03-12'\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: ytt_sample   partitions: NULL         type: ALLpossible_keys: idx_r6          key: NULL      key_len: NULL          ref: NULL         rows: 100     filtered: 73.00        Extra: Using where1 row in set, 1 warning (0.00 sec)# EXPLAIN SQL 3(localhost:mysqld.sock)|(ytt)>explain select count(*) from ytt_sample where r6 between '2020-01-25' and '2021-03-12'\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: ytt_sample   partitions: NULL         type: rangepossible_keys: idx_r6          key: idx_r6      key_len: 4          ref: NULL         rows: 73     filtered: 100.00        Extra: Using where; Using index1 row in set, 1 warning (0.00 sec)

从执行打算看，SQL 2 没走索引，走全表扫描；而 SQL 3 间接走索引取回记录数，防止了拜访表数据。那为什么两条 SQL 的过滤条件一样，执行打算却不一样呢？

SQL 2 程序全表扫描表数据的速度要比走索引再随机扫描表数据快很多，因为要打印的记录数有 73 条，快靠近表记录数 100 条了。索引 idx_r6 的基数其实很高了，然而因为最终返回的记录数太多，MySQL 只能放弃走索引；而 SQL 3 因为只求合乎过滤条件的记录数，间接从索引入口就能够计算出来后果。

那如果把 SQL 2 的过滤条件膨胀下，再次查看执行打算：

(localhost:mysqld.sock)|(ytt)>explain select * from ytt_sample where r6 between '2020-12-25' and '2021-03-12'\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: ytt_sample   partitions: NULL         type: rangepossible_keys: idx_r6          key: idx_r6      key_len: 4          ref: NULL         rows: 15     filtered: 100.00        Extra: Using index condition1 row in set, 1 warning (0.00 sec)

从执行打算看，间接走了索引。

那之前的 SQL 2 过滤条件要是不变动，是否会用到索引呢？这个就与索引的基数值以及索引基于肯定过滤条件的选择性好坏有很大的关系。比方因为某些业务变动，表 ytt_sample 字段 r6 的数据分布产生了变动（表行数变大，字段 r6 对应范畴的数据收窄），数据变动过程此处省略。

那还是同样的过滤条件：where r6 between '2020-01-25' and '2021-03-12'。

再次看看 SQL 2 的执行打算，很完满的走了索引：

(localhost:mysqld.sock)|(ytt)>explain  select * from ytt_sample where r6 between '2020-01-25' and '2021-03-12'\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: ytt_sample   partitions: NULL         type: rangepossible_keys: idx_r6          key: idx_r6      key_len: 4          ref: NULL         rows: 73     filtered: 100.00        Extra: Using index condition1 row in set, 1 warning (0.00 sec)(localhost:mysqld.sock)|(ytt)>

因为数据分布发生变化，索引基数值也产生了扭转，基于同样过滤条件的索引选择性也从差变好，所以 SQL 2 间接走索引 idx_r6 返回数据。

再次查看此时的索引基数值：

(localhost:mysqld.sock)|(ytt)>select table_schema,table_name,index_name,cardinality from information_schema.statistics where table_schema='ytt' and table_name = 'ytt_sample' order by cardinality desc;+--------------+------------+------------+-------------+| TABLE_SCHEMA | TABLE_NAME | INDEX_NAME | CARDINALITY |+--------------+------------+------------+-------------+| ytt          | ytt_sample | PRIMARY    |         808 || ytt          | ytt_sample | idx_r6     |         503 || ytt          | ytt_sample | idx_u1     |          98 || ytt          | ytt_sample | idx_u1     |          92 || ytt          | ytt_sample | idx_u2     |          79 || ytt          | ytt_sample | idx_r1     |          29 || ytt          | ytt_sample | idx_u1     |          29 || ytt          | ytt_sample | idx_u2     |          20 |+--------------+------------+------------+-------------+8 rows in set (0.00 sec)

索引 idx_r6 的基数由之前的 85 晋升到 500 多。所以即便同样的字段，同样的过滤条件，不同的索引基数值以及基于索引基数值的索引选择性高下的不同，也会让优化器抉择不同的执行打算。