关于后端:MySQL千万数据查询优化之路

本文次要针对 MySQL 在千万级别数据的分页查问性能进行优化, 上面是整个优化的过程.

先说论断, MySQL 在千万级别数据的分页查问性能次要受到 2 个因素的影响:

查问的偏移量
查问的数据量

当 MySQL 执行查问语句分页 LIMIT 时, 有 2 个步骤须要先依照指定的排序规定对数据进行排序, 而后跳过指定的偏移量。

如果查问的偏移量比拟大, 那么排序的工夫就会比拟长 (B+ 树索引能够极大优化该阶段性能)

然而 B+ 树在跳过指定的偏移量时, 须要程序遍历, O(n) 的复杂度, 千万级的偏移量也是比较慢

优化思路:

给排序的字段加上 B + 树索引
应用子查问确定查问范畴 (比方, 主键的范畴, BETWEEN 等 )
连表查问, 小表驱动大表, 通过小表的索引来确定大表的范畴, 缩小偏移量

指定列代替 SELECT *
缩小不须要的列, 特地是大字段
一次尽可能按需查问较少的数据条数
缓存查问后果 (比方 redis) 来缩小查问次数

CREATE TABLE `big_tables` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(256) DEFAULT NULL,
  `age` bigint DEFAULT NULL,
  `data` longblob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10010001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

我应用的是 golang gorm 来导入数据, 代码如下:

var (
    configFile = "../config-dev.yaml"
    DB         *gorm.DB
)

type BigTable struct {
    ID   uint64 `gorm:"column:id;primary_key;auto_increment"`
    Name string `gorm:"column:name"`
    Age  int    `gorm:"column:age"`
    Data []byte `gorm:"column:data"`}

func TestBitTable_InsertData(t *testing.T) {
    var err error

    DB.AutoMigrate(&BigTable{})

    // 敞开日志
    DB.Logger = logger.Default.LogMode(logger.Silent)

    // 批量插入 1000w 条数据, 每次插入 10000 条
    batches := 1000
    size := 10_000
    bigTables := make([]*BigTable, 0, size)
    for i := 0; i < batches; i++ {
        for i := 0; i < size; i++ {
            bigTables = append(bigTables, &BigTable{Name: utils.RandString(10),
                Age:  utils.RandInt(10),
                Data: utils.RandBytes(10),
            })
        }
        var task = func(idx int, db *gorm.DB) {err = db.CreateInBatches(bigTables, size).Error
            if err != nil {t.Error(err)
            }
            log.Printf("批次: %v, 实现 \n", idx)
        }

        task(i, DB)

        // 清空
        bigTables = bigTables[:0]
    }

    log.Printf("\n 插入实现 \n")
}

一分钟左右就能够导入 1000w 条数据

mysql> select count(*) from big_tables;
+----------+
| count(*) |
+----------+
| 10010000 |
+----------+
1 row in set (1.27 sec)

命令

select * from big_tables limit 10000, 1;
select * from big_tables limit 100000, 1;
select * from big_tables limit 1000000, 1;
select * from big_tables limit 10000000, 1;

后果

mysql> select * from big_tables limit 10000, 1;
+-------+------------+-----+------------+
| id    | name       | age | data       |
+-------+------------+-----+------------+
| 10001 | I6pC5NBFD9 |   7 | x4zXHhnPnW |
+-------+------------+-----+------------+
1 row in set (0.10 sec)

mysql> select * from big_tables limit 100000, 1;
+--------+------------+-----+------------+
| id     | name       | age | data       |
+--------+------------+-----+------------+
| 100001 | PzpzEZDX9G |   0 | B48IvBLlWo |
+--------+------------+-----+------------+
1 row in set (0.13 sec)

mysql> select * from big_tables limit 1000000, 1;
+---------+------------+-----+------------+
| id      | name       | age | data       |
+---------+------------+-----+------------+
| 1000001 | 4niiNSTHtx |   5 | tdCK9VuVWJ |
+---------+------------+-----+------------+
1 row in set (0.52 sec)

mysql> select * from big_tables limit 10000000, 1;
+----------+------------+-----+------------+
| id       | name       | age | data       |
+----------+------------+-----+------------+
| 10000001 | R0HWlAyf7R |   0 | kHDTpsmtcg |
+----------+------------+-----+------------+
1 row in set (5.86 sec)

表格:

偏移量	查问工夫
1w	0.10s
10w	0.13s
100w	0.52s
1000w	5.86s

能够看到 1w 到 10w 的查问工夫根本不变, 然而 100w 到 1000w 的查问工夫根本线性增长

因为 B+ 树在跳过指定的偏移量时, 须要程序遍历, O(n) 的复杂度

命令

select * from big_tables limit 100000, 10;
select * from big_tables limit 100000, 100;
select * from big_tables limit 100000, 1000;
select * from big_tables limit 100000, 10000;

后果

mysql> select * from big_tables limit 100000, 10;
# 数据太多, 省略
10 rows in set (0.21 sec)

mysql> select * from big_tables limit 100000, 100;
# 数据太多, 省略
100 rows in set (0.35 sec)

mysql> select * from big_tables limit 100000, 1000;
# 数据太多, 省略
1000 rows in set (1.93 sec)

mysql> select * from big_tables limit 100000, 10000;
# 数据太多, 省略
10000 rows in set (21.20 sec)

表格

数据量	查问工夫
10	0.21s
100	0.35s
1000	1.93s
10000	21.20s

能够看到, 数据量越大, 查问工夫越长. 数据 1000-10000 的查问工夫根本线性增长 (这里我的 MySQL 就在本机上, 如果是近程 MySQL 网络 IO 产生的工夫将更长)

然而个别查问的数据量不会太大, 个别都是 10 条左右

优化计划如下:

指定列代替 SELECT *
缩小不须要的列, 特地是大字段
一次尽可能按需查问较少的数据条数
缓存查问后果 (比方 redis) 来缩小查问次数

优化计划比较简单, 容易了解, 前面就不再赘述了

1. 子查问

先查问 id 的地位, 而后再依据 id 的地位查问数据

命令

select * from big_tables where id >= (select id from big_tables limit 10000000, 1) limit 0, 1;

后果

mysql> select * from big_tables where id >= (select id from big_tables limit 10000000, 1) limit 0, 1;
+----------+------------+-----+------------+
| id       | name       | age | data       |
+----------+------------+-----+------------+
| 10000001 | R0HWlAyf7R |   0 | kHDTpsmtcg |
+----------+------------+-----+------------+
1 row in set (2.69 sec)

表格

是否应用子查问	偏移量	查问工夫
是	1000w	2.69s
否	1000w	5.86s

能够看到, 应用子查问后, 查问工夫缩小了 50% 以上.

然而还是在秒级别, 达不到毫秒级别的业务需要

子查问命令

explain select * from big_tables where id >= (select id from big_tables limit 10000000, 1) limit 0, 1;

后果

mysql> explain select * from big_tables where id >= (select id from big_tables limit 10000000, 1) limit 0, 1;
+----+-------------+------------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table      | partitions | type  | possible_keys | key     | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
|  1 | PRIMARY     | big_tables | NULL       | range | PRIMARY       | PRIMARY | 8       | NULL |   19290 |   100.00 | Using where |
|  2 | SUBQUERY    | big_tables | NULL       | index | NULL          | PRIMARY | 8       | NULL | 9750719 |   100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
2 rows in set (3.20 sec)

子查问 (第二行):

子查问的 type 是 index, 示意应用了索引扫描全表
子查问的 key 是 PRIMARY, 示意应用了主键索引
子查问的 rows 是 9750719, 示意扫描了 9750719 行数据 (粗略计算的, 因为 MySQL 每页 16KB)
子查问的 Extra 是 Using index, 示意应用了笼罩索引

主查问 (第一行):

主查问 type 是 range, 示意应用了索引范畴扫描
主查问 key 是 PRIMARY, 示意应用了主键索引
主查问 rows 是 19290, 示意扫描了 19290 行数据
主查问 Extra 是 Using where, 示意应用了 where 条件

从下面能够看出: 子查问应用了索引扫描全表, 扫描行数量千万级别, 所以查问工夫很长

给主键加上 B+ 树的惟一索引

命令

# add unique index
alter table big_tables add unique index id(id) using btree;
# query2
select * from big_tables where id >= (select id from big_tables limit 10000000, 1) limit 0, 1;
# explain query2
explain select * from big_tables where id >= (select id from big_tables limit 10000000, 1) limit 0, 1;

后果

mysql> # add unique index
alter table big_tables add unique index id(id) using btree;
Query OK, 0 rows affected (35.82 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> # query2
select * from big_tables where id >= (select id from big_tables limit 10000000, 1) limit 0, 1;
+----------+------------+-----+------------+
| id       | name       | age | data       |
+----------+------------+-----+------------+
| 10000001 | R0HWlAyf7R |   0 | kHDTpsmtcg |
+----------+------------+-----+------------+
1 row in set (1.25 sec)
mysql> # explain query2
explain select * from big_tables where id >= (select id from big_tables limit 10000000, 1) limit 0, 1;
+----+-------------+------------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table      | partitions | type  | possible_keys | key     | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
|  1 | PRIMARY     | big_tables | NULL       | range | PRIMARY,id    | PRIMARY | 8       | NULL |   19290 |   100.00 | Using where |
|  2 | SUBQUERY    | big_tables | NULL       | index | NULL          | id      | 8       | NULL | 9750719 |   100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
2 rows in set (1.47 sec)

表格

状态	偏移量	查问工夫
不必子查问	1000w	5.86s
子查问	1000w	2.69s
子查问 + 惟一索引	1000w	1.25s

能够看到, 给主键加上惟一索引后, 查问工夫缩小了 50% 以上

因为某些 mysql 版本不反对在 in 子句中应用 limit, 所以这里多嵌套了一层子查问

select * from big_tables where id in (
    select id from (select id from big_tables limit 10000000, 1) as t
) limit 0, 1;

下面咱们晓得, 查问的耗费次要是在索引遍历的过程中, 如果 id 是间断递增的, 能够应用 WHERE 来优化

# query3
select * from big_tables where id >= 10000000 limit 0, 1;
# explain query3
explain select * from big_tables where id >= 10000000 limit 0, 1;

后果

mysql> # query3
select * from big_tables where id >= 10000000 limit 0, 1;
+----------+------------+-----+------------+
| id       | name       | age | data       |
+----------+------------+-----+------------+
| 10000000 | Hey8TWX966 |   7 | kSjxDkL1qj |
+----------+------------+-----+------------+
1 row in set (0.08 sec)

mysql> # explain query3
explain select * from big_tables where id >= 10000000 limit 0, 1;
+----+-------------+------------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+
| id | select_type | table      | partitions | type  | possible_keys | key     | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | big_tables | NULL       | range | PRIMARY,id    | PRIMARY | 8       | NULL | 19298 |   100.00 | Using where |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+
1 row in set (0.13 sec)

性能从原来的 5.86s 升高到了 0.08s, 晋升 73 倍.

因为 rows 扫描行数不再是千万级别, 而只有一页的大小

在 MySQL 中，新增的惟一索引须要查问 2 次，第一次是查问索引树，第二次是查问数据页，而主键索引是惟一索引，所以查问主键索引时，只须要查问一次索引树即可。

然而在测试中, 1000 万的偏移量的查问下, 再给主键加上惟一索引查问更快, 这是为什么呢?

参考:

https://dba.stackexchange.com/questions/290617/why-does-mysql…
https://stackoverflow.com/questions/75937219/why-add-unique-index-to-primary-index-is-faster

本文由 mdnice 多平台公布

MySQL 千万数据查问优化之路

先说论断

查问的偏移量优化

查问的数据量优化

筹备数据

建表

导入数据

查看导入的数据

一般查问

查问 offset 为 1w, 10w, 100w, 1000w 的 1 条数据

查问 offset 为 10, 100, 1000, 10000 条数据

优化: 偏移量导致的查问慢

1. 子查问

2. 子查问 EXPLAIN 剖析

给主键加上惟一索引

如果主键不是递增的, 比方是字符串, 须要用 IN 查问

如果主键是线性递增, 能够应用 WHERE 优化

为什么给主键加上惟一索引查问更快