共计 7194 个字符,预计需要花费 18 分钟才能阅读完成。
摘要: 如果能够将外连贯转换为内连贯,那么就能够简化查问优化过程。
本文分享自华为云社区《GaussDB(DWS) 外连贯向内连贯的转换》,作者:wangxiaojuan8。
在查问优化的过程中,内连贯的表之间的连贯程序能够随便替换,where 或 on 条件中只波及单表的条件能够下推到表上作为表的过滤条件;而对于外连贯来说,表的连贯程序不能随便替换,约束条件也不能随便的下推。如果能够将外连贯转换为内连贯,那么就能够简化查问优化过程。
外连贯可转为内连贯需满足的条件
为了形容不便,引入两个名词:
- 不空侧:外连贯中所有数据都被输入的一侧。比方:左外连贯的左表、右外连贯的右表
- 可空侧:外连贯中会被补空值的一侧。比方:左外连贯的右表、右外连贯的左表、全外连贯的左表和右表
只有满足以下条件之一,就能够将外连贯转换为内连贯:
- Where 条件中有“严格”的约束条件,且该约束条件中援用了可空侧的表中列。这样,该谓词便能够将可空侧产生的空值都过滤掉了,使得最终后果等同于内连贯。
参考《PostgreSQL 技术底细 - 查问优化深度摸索》一书中对“严格”的定义如下:
“严格”的准确定义是对于一个函数、操作符或者表达式,如果输出参数是 NULL 值,那么输入也肯定是 NULL 值,就能够说这个函数、操作符或者表达式是严格的;然而宽泛的说,对于函数、操作符或者表达式,如果输出参数是 NULL 值,输入后果是 NULL 值或者 FALSE,那么就认为这个函数或者操作符是严格的。如果在约束条件里有这种严格的操作符、函数或者表达式,因为输出是 NULL 值,输入是 NULL 或者 FALSE,那么对于含有 NULL 值的元组就会被过滤掉。
本文沿用 GuassDB(DWS) 的连贯类型介绍一文中的表和数据进行举例说明。
例 1:查问数学问题不为空的学生的数学问题
postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null;
id | name | score
----+-------+-------
1 | Tom | 80
2 | Lily | 75
4 | Perry | 95
(3 rows)
postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null;
QUERY PLAN
----------------------------------------------------------------------------
id | operation | E-rows | E-width | E-costs
----+----------------------------------------+--------+---------+---------
1 | -> Streaming (type: GATHER) | 30 | 126 | 36.59
2 | -> Hash Join (3,4) | 30 | 126 | 28.59
3 | -> Seq Scan on student s | 30 | 122 | 14.14
4 | -> Hash | 29 | 8 | 14.14
5 | -> Seq Scan on math_score ms | 30 | 8 | 14.14
Predicate Information (identified by plan id)
---------------------------------------------
2 --Hash Join (3,4)
Hash Cond: (s.id = ms.id)
5 --Seq Scan on math_score ms
Filter: (score IS NOT NULL)
(14 rows)
例 2:查问数学问题高于 80 的学生的数学问题
postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80;
id | name | score
----+-------+-------
4 | Perry | 95
(1 row)
postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80;
QUERY PLAN
----------------------------------------------------------------------------
id | operation | E-rows | E-width | E-costs
----+----------------------------------------+--------+---------+---------
1 | -> Streaming (type: GATHER) | 10 | 126 | 36.44
2 | -> Hash Join (3,4) | 10 | 126 | 28.44
3 | -> Seq Scan on student s | 30 | 122 | 14.14
4 | -> Hash | 10 | 8 | 14.18
5 | -> Seq Scan on math_score ms | 10 | 8 | 14.18
Predicate Information (identified by plan id)
---------------------------------------------
2 --Hash Join (3,4)
Hash Cond: (s.id = ms.id)
5 --Seq Scan on math_score ms
Filter: (score > 80)
(14 rows)
下面两个例子中,条件 where ms.score is not null 和 where ms.score > 80,如果输出的 score 为 NULL,则这个约束条件返回的是 false,满足了宽泛的“严格”定义。所以能够将外连贯打消,转换为内连贯。从下面的查问打算也失去了验证。而且这种外连贯打消是能够有数据库的查问优化器来主动解决的。
- On 连贯条件中,如果不空侧列中的值是可空侧列的子集,且可空侧的值都不为 NULL。典型的,不空侧的列为外键,可空侧的列为主键,且两者之间是主外键参考关系。
CREATE TABLE student(
id INTEGER primary key,
name varchar(50)
);
CREATE TABLE math_score(id INTEGER, -- 因为 GaussDB(DWS) 不反对外键,故此处省去了外键定义,但保障该列的值是 student 表中 id 列的子集
score INTEGER
);
INSERT INTO student VALUES(1, 'Tom');
INSERT INTO student VALUES(2, 'Lily');
INSERT INTO student VALUES(3, 'Tina');
INSERT INTO student VALUES(4, 'Perry');
INSERT INTO math_score VALUES(1, 80);
INSERT INTO math_score VALUES(2, 75);
INSERT INTO math_score VALUES(4, 95);
那么上面这个外连贯则和内连贯的后果雷同:
postgres=# select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id);
id | name | score
----+-------+-------
1 | Tom | 80
2 | Lily | 75
4 | Perry | 95
(3 rows)
postgres=# explain select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id);
QUERY PLAN
-------------------------------------------------------------------------
id | operation | E-rows | E-width | E-costs
----+-------------------------------------+--------+---------+---------
1 | -> Streaming (type: GATHER) | 30 | 126 | 36.59
2 | -> Hash Left Join (3, 4) | 30 | 126 | 28.59
3 | -> Seq Scan on math_score ms | 30 | 8 | 14.14
4 | -> Hash | 29 | 122 | 14.14
5 | -> Seq Scan on student s | 30 | 122 | 14.14
Predicate Information (identified by plan id)
---------------------------------------------
2 --Hash Left Join (3, 4)
Hash Cond: (ms.id = s.id)
(12 rows)
postgres=# select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id);
id | name | score
----+-------+-------
1 | Tom | 80
2 | Lily | 75
4 | Perry | 95
(3 rows)
postgres=# explain select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id);
QUERY PLAN
----------------------------------------------------------------------------
id | operation | E-rows | E-width | E-costs
----+----------------------------------------+--------+---------+---------
1 | -> Streaming (type: GATHER) | 30 | 126 | 36.59
2 | -> Hash Join (3,4) | 30 | 126 | 28.59
3 | -> Seq Scan on student s | 30 | 122 | 14.14
4 | -> Hash | 29 | 8 | 14.14
5 | -> Seq Scan on math_score ms | 30 | 8 | 14.14
Predicate Information (identified by plan id)
---------------------------------------------
2 --Hash Join (3,4)
Hash Cond: (s.id = ms.id)
(12 rows)
因为 GaussDB(DWS) 不反对外键,所以这种可将外连贯打消转换为内连贯的条件无奈被优化器辨认而进行主动转换,然而可帮忙开发者在日常编写 SQL 时加以人工辨认,手工打消外连贯。
一个乏味的改写示例
有如下的一个用例:
Select count(1)
from student s left join math_score ms on (s.id = ms.id)
where s.id = 2
and ms.score > 70;
postgres=# Select count(1)
postgres-# from student s left join math_score ms on (s.id = ms.id)
postgres-# where s.id = 2
postgres-# and ms.score > 70;
count
-------
0
(1 row)
postgres=# explain Select count(1)
postgres-# from student s left join math_score ms on (s.id = ms.id)
postgres-# where s.id = 2
postgres-# and ms.score > 70;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
id | operation | E-rows | E-width | E-costs
----+-----------------------------------------------------------------+--------+---------+---------
1 | -> Aggregate | 1 | 8 | 26.51
2 | -> Streaming (type: GATHER) | 1 | 8 | 26.51
3 | -> Aggregate | 1 | 8 | 22.51
4 | -> Nested Loop (5,6) | 3 | 0 | 22.49
5 | -> Index Only Scan using student_pkey on student s | 1 | 4 | 8.27
6 | -> Seq Scan on math_score ms | 1 | 4 | 14.21
Predicate Information (identified by plan id)
-----------------------------------------------------
5 --Index Only Scan using student_pkey on student s
Index Cond: (id = 2)
6 --Seq Scan on math_score ms
Filter: ((score > 70) AND (id = 2))
(15 rows)
从下面的打算可见,sql 中的左外连贯曾经被优化为穿插连贯,因为在 4 号算子 Nest Loop 上没有 join 条件。
此查问最终是要计算两个表连贯后果的总行数。对于穿插连贯来说,两表连贯的行数等于左表行数和右表行数的乘积。所以此查问能够批改为如下等价查问:
explain select lcount * rcount as count
from (select count(1) lcount from student where id = 2) s,
(select count(1) rcount from math_score where score > 70 and id = 2) ms;
postgres=# select lcount * rcount as count
postgres-# from (select count(1) lcount from student where id = 2) s,
postgres-# (select count(1) rcount from math_score where score > 70 and id = 2) ms;
count
-------
1
(1 row)
postgres=# explain select lcount * rcount as count
postgres-# from (select count(1) lcount from student where id = 2) s,
postgres-# (select count(1) rcount from math_score where score > 70 and id = 2) ms;
QUERY PLAN
------------------------------------------------------------------------------------------------------
id | operation | E-rows | E-width | E-costs
----+------------------------------------------------------------------+--------+---------+---------
1 | -> Streaming (type: GATHER) | 1 | 16 | 26.56
2 | -> Nested Loop (3,7) | 1 | 16 | 22.56
3 | -> Aggregate | 1 | 8 | 8.29
4 | -> Streaming(type: BROADCAST) | 1 | 8 | 8.29
5 | -> Aggregate | 1 | 8 | 8.28
6 | -> Index Only Scan using student_pkey on student | 1 | 0 | 8.27
7 | -> Materialize | 1 | 8 | 14.25
8 | -> Aggregate | 1 | 8 | 14.23
9 | -> Streaming(type: BROADCAST) | 1 | 8 | 14.23
10 | -> Aggregate | 1 | 8 | 14.22
11 | -> Seq Scan on math_score | 1 | 0 | 14.21
Predicate Information (identified by plan id)
---------------------------------------------------
6 --Index Only Scan using student_pkey on student
Index Cond: (id = 2)
11 --Seq Scan on math_score
Filter: ((score > 70) AND (id = 2))
(20 rows)
通过这种改写,能够将汇集操作推到 Nested Loop 的每个子树中执行,当 Nested Loop 的每个子树的数据量比拟大时,汇集能够大大降低后果集,缩小参加 join 的数据量,从而进步性能。
上面例子留作思考:
Select sum(score)
From student s left join math_score ms on (s.id = ms.id)
Where s.id = 2
And ms.score > 70;
点击关注,第一工夫理解华为云陈腐技术~