乐趣区

关于数据库:庖丁解牛|图解-MySQL-80-优化器查询转换篇

简介:本篇介绍子查问、剖析表和 JOIN 的简单转换过程

一 背景和架构

在《庖丁解牛 - 图解 MySQL 8.0 优化器查问解析篇》一文中咱们重点介绍了 MySQL 最新版本 8.0.25 对于 SQL 根本元素表、列、函数、聚合、分组、排序等元素的解析、设置和转换过程,本篇咱们持续来介绍更为简单的子查问、分区表和 JOIN 的简单转换过程,纲要如下:

Transformation

remove_redundant_subquery_clause :Permanently remove redundant parts from the query if 1) This is a subquery 2) Not normalizing a view. Removal should take place when a query involving a view is optimized, not when the view is created.

remove_base_options:Remove SELECT_DISTINCT options from a query block if can skip distinct

resolve_subquery :Resolve predicate involving subquery, perform early unconditional subquery transformations

Convert subquery predicate into semi-join, or
Mark the subquery for execution using materialization, or
Perform IN->EXISTS transformation, or
Perform more/less ALL/ANY -> MIN/MAX rewrite
Substitute trivial scalar-context subquery with its value

transform_scalar_subqueries_to_join_with_derived:Transform eligible scalar subqueries to derived tables.

flatten_subqueries :Convert semi-join subquery predicates into semi-join join nests. Convert candidate subquery predicates into semi-join join nests. This transformation is performed once in query lifetime and is irreversible.

apply_local_transforms :

delete_unused_merged_columns : If query block contains one or more merged derived tables/views, walk through lists of columns in select lists and remove unused columns.

simplify_joins : Convert all outer joins to inner joins if possible.

prune_partitions:Perform partition pruning for a given table and condition.

push_conditions_to_derived_tables :Pushing conditions down to derived tables must be done after validity checks of grouped queries done by apply_local_transforms();

Window::eliminate_unused_objects:Eliminate unused window definitions, redundant sorts etc.

二 具体转换过程

1 解析子查问(resolve_subquery)

解析条件中带有子查问的语句,做一些晚期的无限度的子查问转换,包含:

标记 subquery 是否变成 semi-join

转换判断条件

  • 查看 OPTIMIZER_SWITCH_SEMIJOIN 和 HINT 没有限度
  • 子查问是 IN/=ANY 和 EXIST subquery 的谓词
  • 子查问是简略查问块而不是 UNION
  • 子查问无隐形和显性的 GROUP BY
  • 子查问没有 HAVING、WINDOW 函数
  • Resolve 的阶段是 Query_block::RESOLVE_CONDITION 和 Query_block::RESOLVE_JOIN_NEST 并且没有用到最新的 Hyper optimizer 优化器。
  • 外查问块能够反对 semijoins
  • 至多要一个表,而不是相似 ”SELECT 1″
  • 子查问的策略还没有指定 Subquery_strategy::UNSPECIFIED
  • 父查问也至多有一个表
  • 父查问和子查问都不能有 straight join
  • 父查问块不禁止 semijoin
  • IN 谓词返回值是否是确定的,不是 RAND
  • 依据子查问判断后果是否须要转成 true 还是 false 以及是否为 NULL,判断是能够做 antijoin 还是 semijoin
  • Antijoin 是能够反对的,或者是 semijoin
  • offset 和 limit 对于 semjoin 是无效的,offset 是从第一行开始,limit 也不是 0

设置 Subquery_strategy::CANDIDATE_FOR_SEMIJOIN 并增加 sj_candidates

标记 subquery 是否执行时采纳 materialization 计划

  • 如果不合乎转换 semijoin,尝试应用物化形式,转换判断条件
  • Optimzier 开关 subquery_to_derived=on
  • 子查问是 IN/=ANY or EXISTS 谓词
  • 子查问是简略查问块而不是 UNION
  • 如果是[NOT] EXISTS,必须没有聚合
  • Subquery 谓词在 WHERE 子句(目前没有在 ON 子句实现),而且是 ANDs or ORs 的表达式 tree
  • 父查问块反对 semijoins
  • 子查问的策略还没有指定 Subquery_strategy::UNSPECIFIED
  • 父查问也至多有一个表,而后能够做 LEFT JOIN
  • 父查问块不禁止 semijoin
  • IN 谓词返回值是否是确定的,不是 RAND
  • 依据子查问判断后果是否须要转成 true 还是 false 以及是否为 NULL,判断是能够做 antijoin 还是 semijoin
  • 不反对右边参数不是 multi-column 子查问(WHERE (outer_subq) = ROW(derived.col1,derived.col2))
  • 该子查问不反对转换为 Derived table(m_subquery_to_derived_is_impossible)
  • 设置 Subquery_strategy::CANDIDATE_FOR_DERIVED_TABLE 并增加 sj_candidates

如果下面两个策略无奈应用,依据类型抉择 transformer

Item_singlerow_subselect::select_transformer
对于简略的标量子查问,在查问中间接用执行后果代替

select * from t1 where a = (select 1); 
=>
select * from t1 where a = 1;

Item_in_subselect/Item_allany_subselect::select_transformer->select_in_like_transformer

select_in_like_transformer 函数来解决 IN/ALL/ANY/SOME 子查问转换 transformation

解决 ”SELECT 1″(Item_in_optimizer)

如果目前还没有子查问的执行形式,也就是无奈应用 semijoin/antijoin 执行的子查问,会做 IN->EXISTS 的转换,实质是在物化执行和迭代式循环执行中做抉择。IN 语法代表非相干子查问仅执行一次,将查问后果物化成长期表,之后须要后果时候就去物化表中查找;EXISTS 代表对于表面的每一条记录,子查问都会执行一次,是迭代式循环执行。子查问策略设定为 Subquery_strategy::CANDIDATE_FOR_IN2EXISTS_OR_MAT

重写 single-column 的 IN/ALL/ANY 子查问(single_value_transformer)

oe $cmp$ (SELECT ie FROM ... WHERE subq_where ... HAVING subq_having)
=>
- oe $cmp$ (SELECT MAX(...) )  // handled by Item_singlerow_subselect
- oe $cmp$ \<max\>(SELECT ...)   // handled by Item_maxmin_subselect
​
fails=>Item_in_optimizer
- 对于曾经是 materialized 计划,不转换
- 通过 equi-join 转换 IN 到 EXISTS

如果是 ALL/ANY 单值 subquery 谓词,尝试用 MIN/MAX 子查问转换

SELECT * FROM t1 WHERE a < ANY (SELECT a FROM t1); 
=>
SELECT * FROM t1 WHERE a < (SELECT MAX(a) FROM t1)

不满足下面,调用 single_value_in_to_exists_transformer 转换 IN 到 EXISTS

转换将要将子查问设置为相干子查问,设置 UNCACHEABLE_DEPENDENT 标识

如果子查问蕴含聚合函数、窗口函数、GROUP 语法、HAVING 语法,将判断条件退出到 HAVING 子句中,另外通过 ref_or_null_helper 来辨别 NULL 和 False 的后果,如须要解决 NULL IN (SELECT …)还须要封装到 Item_func_trig_cond 触发器中。

SELECT ... FROM t1 WHERE t1.b IN (SELECT <expr of SUM(t1.a)> FROM t2)
=>
SELECT ... FROM t1 WHERE t1.b IN (SELECT <expr of SUM(t1.a)> FROM t2
                                 [trigcond] HAVING t1.b=ref-to-<expr of SUM(t1.a)>)

如果子查问不蕴含聚合函数、窗口函数、GROUP 语法,会放在 WHERE 查问条件中,当然如果须要解决 NULL 状况还是要放入 HAVING 子句(Item_func_trig_cond+Item_is_not_null_test)。

不须要辨别 NULL 和 FALSE 的子查问:
​
SELECT 1 FROM ... WHERE (oe $cmp$ ie) AND subq_where
​
须要辨别的子查问:
SELECT 1 FROM ...
      WHERE subq_where AND trigcond((oe $cmp$ ie) OR (ie IS NULL))
      HAVING trigcond(@<is_not_null_test@>(ie))


JOIN::optimize()会计算 materialization 和 EXISTS 转换的代价进行抉择,设置 m_subquery_to_derived_is_impossible = true

ROW 值转换,通过 Item_in_optimizer,不反对 ALL/ANY/SOME(row_value_transformer)

Item_in_subselect::row_value_in_to_exists_transformer

for (each left operand)
  create the equi-join condition
  if (is_having_used || !abort_on_null)
    create the "is null" and is_not_null_test items
  if (is_having_used)
    add the equi-join and the null tests to HAVING
  else
    add the equi-join and the "is null" to WHERE
    add the is_not_null_test to HAVING

没有 HAVING 表达式

(l1, l2, l3) IN (SELECT v1, v2, v3 ... WHERE where) =>
EXISTS (SELECT ... WHERE where and
                         (l1 = v1 or is null v1) and
                         (l2 = v2 or is null v2) and
                         (l3 = v3 or is null v3)
                   [HAVING is_not_null_test(v1) and
                         is_not_null_test(v2) and
                         is_not_null_test(v3)) ] <-- 保障不为 NULL 能够去掉 HAVING

有 HAVING 表达式

(l1, l2, l3) IN (SELECT v1, v2, v3 ... HAVING having) =>
EXISTS (SELECT ... HAVING having and
                          (l1 = v1 or is null v1) and
                          (l2 = v2 or is null v2) and
                          (l3 = v3 or is null v3) and
                          is_not_null_test(v1) and
                          is_not_null_test(v2) and
                          is_not_null_test(v3))

2 转换的标量子查问转换成 Derived Table(transform_scalar_subqueries_to_join_with_derived)

该个性是官网在 8.0.16 中为了更好的反对 Secondary Engine(Heapwave)的剖析下推,加强了子查问的转换能力。能够先直观的看下转换和不转换的执行打算的不同:

root:test> set optimizer_switch = 'subquery_to_derived=off';
Query OK, 0 rows affected (0.00 sec)
​
root:test> EXPLAIN SELECT b, MAX(a) AS ma FROM t4 GROUP BY b HAVING ma < (SELECT MAX(t2.a) FROM t2 WHERE t2.b=t4.b);
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| id | select_type        | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra           |
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
|  1 | PRIMARY            | t4    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   10 |   100.00 | Using temporary |
|  2 | DEPENDENT SUBQUERY | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |    33.33 | Using where     |
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
2 rows in set, 3 warnings (0.00 sec)
​
root:test> set optimizer_switch = 'subquery_to_derived=on';
Query OK, 0 rows affected (0.00 sec)
​
root:test> EXPLAIN SELECT b, MAX(a) AS ma FROM t4 GROUP BY b HAVING ma < (SELECT MAX(t2.a) FROM t2 WHERE t2.b=t4.b);
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                      |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
|  1 | PRIMARY     | t4         | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   10 |   100.00 | Using temporary                            |
|  1 | PRIMARY     | <derived2> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Using where; Using join buffer (hash join) |
|  2 | DERIVED     | t2         | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Using temporary                            |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
3 rows in set, 3 warnings (0.01 sec)

transform_scalar_subqueries_to_join_with_derived 具体转换的过程如下:

首先从 JOIN 条件、WHERE 条件、HAVING 条件和 SELECT list 中收集能够转换的标量子查问(Item::collect_scalar_subqueries)。

遍历这些子查问,判断是否能够减少一个额定的转换(transform_grouped_to_derived):把隐性的 GROUP BY 标量子查问变成 Derived Table。

SELECT SUM(c1), (SELECT SUM(c1) FROM t3) scalar FROM t1;
转换为 =>
SELECT derived0.summ, derived1.scalar
FROM (SELECT SUM(a) AS summ FROM t1) AS derived0
       LEFT JOIN
       (SELECT SUM(b) AS scalar FROM t3) AS derived1
       ON TRUE
执行打算如下:explain SELECT SUM(a), (SELECT SUM(c1) FROM t3) scalar FROM t1;
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                      |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
|  1 | PRIMARY     | <derived3> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                       |
|  1 | PRIMARY     | <derived4> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | Using where; Using join buffer (hash join) |
|  4 | DERIVED     | t3         | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                       |
|  3 | DERIVED     | t1         | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    2 |   100.00 | NULL                                       |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+

收集惟一的聚合函数 Item 列表(collect_aggregates),这些 Item 将会被新的 Derived Table 的列代替。

还须要增加所有援用到这些 Item 的 fields,包含间接在 SELECT 列表的,Window 函数参数、ORDER by、Partition by 蕴含的,还有该查问块中 ORDER BY 的列,因为他们都会引动到 Derived Table 里。

创立 Derived Table 须要的 Query_expression/Query_block(create_query_expr_and_block)。

增加 Derived Table 到查问块和 top_join_list 中。

保留旧的子查问单元块,如果蕴含能够转化的 Derived 的移到 Derived Table 上面的 Query_block,如果不蕴含,保留到原来的子查问块中。

将之前的聚合函数 Item 列表插入到 Derived Table 的查问块中。

收集除 GROUP AGG 表达式中的列,因为这些 fields 曾经挪动到 Derived Table 中,删除不合理的 fields 援用。

收集所有惟一的列和 View 的援用后,将他们加到新的 Derived Table 列表中。

对新的新的 Derived Table 进行 flatten_subqueries/setup_tables

从新 resolve_placeholder_tables,不解决进行转换后的子查问。

解决 Derived Table 中,新退出的 HAVING 条件中的聚合函数 Item,并通过 Item_aggregate_refs 援用到 new_derived->base_ref_items 而不是之前的父查问块 base_ref_items。

永恒代替父查问块中的聚合函数列表,变成 Derived Table 的列,并删除他们。

之前保留和退出到 Derived Table 的惟一的列和 View 的援用,也要替换新的 fields 代替他们的援用。

但目前不反对 HAVING 表达式中蕴含该子查问,其实也是能够转换的。

SELECT SUM(a), (SELECT SUM(b) FROM t3) scalar
FROM t1
HAVING SUM(a) > scalar;
转换为 =>
SELECT derived0.summ, derived1.scalar
FROM (SELECT SUM(a) AS summ FROM t1) AS derived0
       LEFT JOIN
       (SELECT SUM(b) AS scalar FROM t3) AS derived1
       ON TRUE
WHERE derived0.sum > derived1.scalar;

接下来遍历所有能够转换的子查问,把他们转换成 derived tables,并替换相应的表达式变成列(transform_subquery_to_derived)。

  • 生成 derived table 的 TABLE_LIST(synthesize_derived)。
  • 将能够挪动到 derived table 的 where_cond 设置到 join_cond 上。
  • 增加 derived table 到查问块的表汇合中。
  • decorrelate_derived_scalar_subquery_pre
  • 增加非相干援用列(NCF)到 SELECT list,这些条件被 JOIN 条件所援用,并且还有另外一个 fields 蕴含了外查问相干的列,咱们称之为 ’lifted_where’
  • 增加 COUNT(*)到 SELECT list,这样转换的查问块能够进行 cardinality 的查看。比方没有任何聚合函数在子查问中。如果确定蕴含聚合函数,返回一行肯定是 NCF 同时在 GROUP BY 列表中。
  • 增加 NCF 到子查问的 GROUP 列表中,如果曾经在了,须要加到最初,如果产生 GROUP BY 的列因为依赖性查看失败,还要加 Item_func_any_value(非聚合列)到 SELECT list。对于 NCF 会创立 derived.field 和 derived.count(field)
  • 设置物化的一些筹备(setup_materialized_derived)。
  • decorrelate_derived_scalar_subquery_post:
  • 创立对应的 ’lifted_fields’。
  • 更新 JOIN 条件中相干列的援用,不在援用外查问而换成 Derived table 相干的列。

代替 WHERE、JOIN、HAVING 条件和 SELECT list 中的子查问的表达式变成对应的 Derived Table 外面列。

上面图解该函数的转换过程和后果:


3 扁平化子查问(flatten_subqueries)

该函数次要是将 Semi-join 子查问转换为 nested JOIN,这个过程只有一次,并且不可逆。

简略来讲步骤能够简化了解为:

  • 创立 SEMI JOIN (it1 … itN)语以局部,并退出到外层查问块的执行打算中。
  • 将子查问的 WHERE 条件以及 JOIN 条件,退出到父查问的 WHERE 条件中。
  • 将子查问谓词从父查问的判断谓词中打消。

因为 MySQL 在一个 query block 中可能 join 的 tables 数是无限的(MAX_TABLES),不是所有 sj_candidates 都能够做因而做 flatten_subqueries 的,因而须要有优先级决定的先后顺序先 unnesting 掉,优先级规定如下:

  • 相干子查问优先于非相干的
  • inner tables 多的子查问大于 inner tables 少的
  • 地位前的子查问大于地位后的
subq_item->sj_convert_priority =
        (((dependent * MAX_TABLES_FOR_SIZE) +  // dependent subqueries first
          child_query_block->leaf_table_count) *
         65536) +           // then with many tables
        (65536 - subq_no);  // then based on position

另外,因为递归调用 flatten_subqueries 是 bottom-up,顺次把上层的子查问开展到外层查问块中。

 for SELECT#1 WHERE X IN (SELECT #2 WHERE Y IN (SELECT#3)) :
​
 Query_block::prepare() (select#1)
    -> fix_fields() on IN condition
        -> Query_block::prepare() on subquery (select#2)
            -> fix_fields() on IN condition
                 -> Query_block::prepare() on subquery (select#3)
                 <- Query_block::prepare()
            <- fix_fields()
            -> flatten_subqueries: merge #3 in #2
            <- flatten_subqueries
        <- Query_block::prepare()
    <- fix_fields()
    -> flatten_subqueries: merge #2 in #1

遍历子查问列表,删除 Item::clean_up_after_removal 标记为 Subquery_strategy::DELETED 的子查问,并且依据优先级规定设置 sj_convert_priority。依据优先级进行排序。

遍历排序后的子查问列表,对于 Subquery_strategy::CANDIDATE_FOR_DERIVED_TABLE 策略的子查问,转换子查问 ([NOT] {IN, EXISTS}) 为 JOIN 的 Derived table(transform_table_subquery_to_join_with_derived)

FROM [tables] WHERE ... AND/OR oe IN (SELECT ie FROM it) ...
=>
FROM (tables) LEFT JOIN (SELECT DISTINCT ie FROM it) AS derived
              ON oe = derived.ie WHERE ... AND/OR derived.ie IS NOT NULL ...

设置策略为 Subquery_strategy::DERIVED_TABLE

  • semijoin 子查问不能和 antijoin 子查问互相嵌套,或者外查问表曾经超过 MAX_TABLE,不做转换,否则标记为 Subquery_strategy::SEMIJOIN 策略。
  • 判断子查问的 WHERE 条件是否为常量。如果判断条件永远为 FALSE,那么子查问后果永远为空。该状况下,调用 Item::clean_up_after_removal 标记为 Subquery_strategy::DELETED,删除该子查问。
  • 如果无奈标记为 Subquery_strategy::DELETED/ 设置 Subquery_strategy::SEMIJOIN 策略的从新标记会 Subquery_strategy::UNSPECIFIED 持续下一个。
  • 替换外层查问的 WHERE 条件中子查问判断的条件(replace_subcondition)
  • 子查问内条件并不永远为 FALSE,或者永远为 FALSE 的状况下,须要改写为 antijoin(antijoin 状况下,子查问后果永远为空,外层查问条件永远通过)。此时将条件改为永远为 True。
  • 子查问永远为 FALSE,且不是 antijoin。那么将外层查问中的条件改成永远为 False。
  • Item_subselect::EXISTS_SUBS 不反对有聚合操作
  • convert_subquery_to_semijoin 函数解析如下模式的 SQL
  • IN/=ANY 谓词
  • 如果条件满足解关联,解关联 decorrelate_condition
  • 增加解关联的内表表达式到 SELECT list
  • 收集 FROM 子句中的表面相干的 derived table 或 join 条件
  • 去掉关联标识 UNCACHEABLE_DEPENDENT,更新 used table
  • Derived table 子查问减少 SELECT_DISTINCT 标识
  • 转换子查问成为一个 derived table,并且插入到所属于的查问块 FROM 后(transform_subquery_to_derived)
  • 创立 derived table 及其 join 条件
  • 遍历父查问块的 WHERE,替换该子查问的 Item 代替成 derived table(replace_subcondition)
  • 遍历排序后的子查问列表,对于 Subquery_strategy::CANDIDATE_FOR_SEMIJOIN 策略的子查问。
  • 判断是否能够转换为 semijoin
  • 遍历排序后的子查问列表,对于 Subquery_strategy::SEMIJOIN 的子查问,开始转换为 semijoin/antijoin(convert_subquery_to_semijoin)
  • convert_subquery_to_semijoin 函数解析如下模式的 SQL

IN/=ANY 谓词

SELECT ...
  FROM ot1 ... otN
  WHERE (oe1, ... oeM) IN (SELECT ie1, ..., ieM
                           FROM it1 ... itK
                          [WHERE inner-cond])
   [AND outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
  SELECT ...
  FROM (ot1 ... otN) SJ (it1 ... itK)
                     ON (oe1, ... oeM) = (ie1, ..., ieM)
                        [AND inner-cond]
  [WHERE outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]

EXISTS 谓词

SELECT ...
  FROM ot1 ... otN
  WHERE EXISTS (SELECT expressions
                FROM it1 ... itK
                [WHERE inner-cond])
   [AND outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
  SELECT ...
  FROM (ot1 ... otN) SJ (it1 ... itK)
                     [ON inner-cond]
  [WHERE outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]

NOT EXISTS 谓词

SELECT ...
  FROM ot1 ... otN
  WHERE NOT EXISTS (SELECT expressions
                FROM it1 ... itK
                [WHERE inner-cond])
   [AND outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
  SELECT ...
  FROM (ot1 ... otN) AJ (it1 ... itK)
                     [ON inner-cond]
  [WHERE outer-cond AND is-null-cond(it1)]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]

NOT IN 谓词

SELECT ...
  FROM ot1 ... otN
  WHERE (oe1, ... oeM) NOT IN (SELECT ie1, ..., ieM
                               FROM it1 ... itK
                               [WHERE inner-cond])
   [AND outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
  SELECT ...
  FROM (ot1 ... otN) AJ (it1 ... itK)
                     ON (oe1, ... oeM) = (ie1, ..., ieM)
                        [AND inner-cond]
  [WHERE outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
  • 查找能够插入 semi-join 嵌套和其生成的条件的地位,比方对于 t1 LEFT JOIN t2, embedding_join_nest 为 t2,t2 也能够是 nested,如 t1 LEFT JOIN (t2 JOIN t3))
  • 生成一个新的 semijoin 嵌套的 TABLE_LIST 表
  • 解决 Antijoin
  • 将子查问中潜在的表合并到上述 join 表(TABLE_LIST::merge_underlying_tables)
  • 将子查问的叶子表插入到以后查问块的叶子表前面,从新设置子查问的叶子表的序号和依赖的表面。将子查问的叶子表重置。
  • 如果是 outer join 的话,在 join 链表中传递可空性(propagate_nullability)
  • 将内层子查问中的关联条件去关联化,这些条件被退出到 semijoin 的列表里。这些条件必须是确定的,仅反对简略判断条件或者由简略判断条件组成的 AND 条件(Query_block::decorrelate_condition)
  • 判断左右条件是否仅依赖于内外层表,将其表达式别离退出到 semijoin 内表面的表达式列表中(decorrelate_equality)
  • 解关联内层查问的 join 条件(Query_block::decorrelate_condition)
  • 移除该子查问表达式在父查问的 AST(Query_express::exclude_level)
  • 依据 semi-join 嵌套产生的 WHERE/JOIN 条件更新对应的 table bitmap(Query_block::fix_tables_after_pullout)
  • 将子查问的 WHERE 条件上拉,更新应用表的信息(Item_cond_and::fix_after_pullout())
  • 依据 semijoin 的条件列表创立 AND 条件,如果有条件为常量 True,则去除该条件;如果常量为 False,则整个条件都去除(Query_block::build_sj_cond)
  • 将创立进去的 semijoin 条件退出到外层查问的 WHERE 条件中
  • 最初遍历排序后的子查问列表,对于没有转换的子查问,对于 Subquery_strategy::UNSPECIFIED 的策略,执行 IN->EXISTS 改写(select_transformer),如果的确原有的子查问曾经有代替的 Item,调用 replace_subcondition 解析并把他们退出到适合的 WHERE 或者 ON 子句。
  • 革除所有的 sj_candidates 列表
  • Semi-join 有 5 中执行形式,本文并不介绍 Optimizer 和 Execution 过程,具体能够参考援用文章中对于 semijoin 的介绍,最初引入下管制 semijoin 优化和执行的优化器开关,其中 semijoin=on/off 是总开关。
SELECT @@optimizer_switch\G
*************************** 1. row ***************************
@@optimizer_switch: ......
                    materialization=on,semijoin=on,loosescan=on,
                    firstmatch=on,
                    subquery_materialization_cost_based=on,
                    ......

下图举例说明该转换过程:

SELECT * FROM t1 WHERE t1.a in (SELECT t2.c1 FROM t2 where t2.c1 > 0);
=>
/* select#1 */
SELECT `t1`.`a` AS `a`
FROM `t1`
SEMI JOIN (`t2`)
WHERE ((`t1`.`a` = `t2`.`c1`) and (`t2`.`c1` > 0))
执行打算如下:explain SELECT * FROM t1 WHERE t1.a in (SELECT t2.c1 FROM t2 where t2.c1 > 0);
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                                     |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | Using where; Start temporary                              |
|  1 | SIMPLE      | t1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    2 |    50.00 | Using where; End temporary; Using join buffer (hash join) |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+

4 利用以后查问块转换(apply_local_transforms)

该函数在 flattern subqueries 之后,bottom-up 调用,次要分几个步骤:

删除无用列(delete_unused_merged_columns)

如果查问块曾经删除了一些 derived tables/views,遍历 SELECT 列表的列,删除不必要的列

简化 JOIN(simplify_joins)

该函数会把 Query_block 中的 top_join_list 的嵌套 join 的简化为扁平化的 join list。嵌套连贯包含 table1 join table2,也蕴含 table1, (table2, table3)这种模式。如果所示的简化过程:

分区表的动态剪枝(prune_partitions)

因为剪枝依据 HASH/RANGE/LIST 及二级分区都有不同,这里简略介绍下剪枝过程,现有 prune_partitions 是在 prepare 和 optimize 阶段会被调用,某些常量子查问被评估执行完。

struct TABLE {
   ......    
   partition_info *part_info{nullptr}; /* Partition related information */
   /* If true, all partitions have been pruned away */
   bool all_partitions_pruned_away{false};
   ......
}
    
SQL tranformation phase
SELECT_LEX::apply_local_transforms
--> prune_partitions
​
for example, select * from employee where company_id = 1000 ;
​
SQL optimizer phase
JOIN::prune_table_partitions
--> prune_partitions 
------>  based on tbl->join_cond_optim() or JOIN::where_cond
​
for example, explain select * from employee where company_id = (select c1 from t1);

举例上面 RANGE 剪枝的过程:

root:ref> CREATE TABLE R2  (
    ->     a INT,
    ->     d INT
    -> ) PARTITION BY RANGE(a) (->         PARTITION p20 VALUES LESS THAN (20),
    ->         PARTITION p40 VALUES LESS THAN (40),
    ->         PARTITION p60 VALUES LESS THAN (60),
    ->         PARTITION p80 VALUES LESS THAN (80),
    ->         PARTITION p100 VALUES LESS THAN MAXVALUE
    -> );
Query OK, 0 rows affected (0.09 sec)
​
root:ref> Select * From R2 where a > 40 and a < 80;

剪枝具体过程如下:

因为剪枝须要依据不同条件产生的 pruning 后果进行交加,因而剪枝过程中须要应用 read_partitions 这样的 bitmap 来保留是否应用该对应分区。另外剪枝过程相似迭代判断,因而引入了 part_iterator 来保留开始、完结和以后,以及对应须要获取区间范畴的 endpoint 函数和获取下一个值 next 的迭代器函数。这里奇妙的使用了指针,来兼容不同分区类型 Hash/Range/List 类型,如下图所示:


获取 join_cond 或者 m_where_cond 的 SEL_TREE 红黑树(get_mm_tree)

  • 调用 find_used_partitions 来获取满足的分区,对于 SEL_TREE 的每个区间(interval):1. 获取区间的左右端点 2. 从右边持续获取下一个满足的分区,直到到左边端点完结,每次调用完满足条件的分区须要应用 bitmap_set_bit 设置该分区在 part_info->read_partitions 上的位点。
  • find_used_partitions 是依据 SEL_TREE 的构造进行递归,如图从左到右遍历 next_key_part(and condition),而后再遍历 SEL_TREE 的左右(也就是高低方向,or condition)深度递归。
 (start)
   |                           $
   |   Partitioning keyparts   $  subpartitioning keyparts
   |                           $
   |     ...          ...      $
   |      |            |       $
   | +---------+  +---------+  $  +-----------+  +-----------+
   \-| par1=c1 |--| par2=c2 |-----| subpar1=c3|--| subpar2=c5|
     +---------+  +---------+  $  +-----------+  +-----------+
          |                    $        |             |
          |                    $        |        +-----------+
          |                    $        |        | subpar2=c6|
          |                    $        |        +-----------+
          |                    $        |
          |                    $  +-----------+  +-----------+
          |                    $  | subpar1=c4|--| subpar2=c8|
          |                    $  +-----------+  +-----------+
          |                    $
          |                    $
     +---------+               $  +------------+  +------------+
     | par1=c2 |------------------| subpar1=c10|--| subpar2=c12|
     +---------+               $  +------------+  +------------+
          |                    $
         ...                   $
​
例如第一行(par1=c1 and par2=c2 and subpar1=c3 and subpar2=c5)的遍历的 stack 将是:in find_used_partitions(key_tree = "subpar2=c5") (***)
in find_used_partitions(key_tree = "subpar1=c3")
in find_used_partitions(key_tree = "par2=c2")   (**)
in find_used_partitions(key_tree = "par1=c1")
in prune_partitions(...)
而后是持续上面的条件,以此类推
or(par1=c1 and par2=c2 and subpar1=c3 and subpar2=c6)or(par1=c1 and par2=c2 and subpar1=c4 and subpar2=c8)or(par1=c2 and subpar1=c10 and subpar2=c12)

下图来展现了 pruning 的构造和过程:

5 下推条件到 Derived Table(push_conditions_to_derived_tables)

该函数将条件下推到 derived tables,具体见 WL#8084 – Condition pushdown to materialized derived table。

root:test> set optimizer_switch = 'derived_merge=off'; // 敞开 dervied_merge 测试下推能力
Query OK, 0 rows affected (0.00 sec)
​
root:test> EXPLAIN FORMAT=tree SELECT * FROM (SELECT c1,c2 FROM t1) as dt     WHERE c1 > 10;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Table scan on dt  (cost=2.51..2.51 rows=1)
    -> Materialize  (cost=2.96..2.96 rows=1)
        -> Filter: (t1.c1 > 10)  (cost=0.35 rows=1)
            -> Table scan on t1  (cost=0.35 rows=1)
 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

过程如下:

遍历 derived table 列表,判断是否能够下推(can_push_condition_to_derived),如果包含上面的状况则不能下推:

  • Derived table 有 UNION
  • Derived table 有 LIMIT
  • Derived table 不能是 outer join 中的内表,会导致更多 NULL 弥补的行
    不能是 CTE 蕴含的 Derived table
  • 创立能够下推到的 Derived table 的 where cond(Condition_pushdown::make_cond_for_derived)
  • 保留残余不能下推的条件(Condition_pushdown::get_remainder_cond)
  • Top-down 递归调用 push_conditions_to_derived_tables

具体图解该过程如下:

三 综述

两篇文章重点介绍了下优化器的基于规定的优化局部,并没有波及更多的基于代价的优化,能够看到对于间接使用规定优化带来执行的减速,那么能够间接转换,尤其是对于查问构造下面的变动类转换,如 merge_derived。对于使用规定优化无奈判断是否带来执行的减速,那么优化器会保留一些长期构造,为后续的代价估算提供更多抉择,如 IN/EXIST/Materialized 转换。当然还有一些,又扭转查问构造又无奈断定是否规定转换带来的执行减速,MySQL 目前还不反对。文章尽管详尽,但无奈笼罩全副状况,也是为了抛砖引玉,还须要读者本人通过调试的办法更进一步理解某一类 SQL 的具体过程。

四 参考资料

《MySQL 8.0 Server 层最新架构详解》

《WL#13520: Transform correlated scalar subqueries》

《WL#8084 – Condition pushdown to materialized derived table》

《WL#2980: Subquery optimization: Semijoin》

WL#3740: Subquery optimization: Semijoin: Pull-out of inner tables
WL#3741: Subquery optimization: Semijoin: Duplicate elimination strategy
WL#3750: Subquery optimization: Semijoin: First-match strategy
WL#3751: Subquery optimization: Semijoin: Inside-out strategy

《WL#4389: Subquery optimizations: Make IN optimizations also handle EXISTS》

《WL#4245: Subquery optimization: Transform NOT EXISTS and NOT IN to anti-join》

《WL#2985: Perform Partition Pruning of Range conditions》
《MySQL · 源码剖析 · Semi-join 优化执行代码剖析》
《MySQL·源码剖析·子查问优化源码剖析》《Optimizing Subqueries, Derived Tables, View References, and Common Table Expressions》

原文链接
本文为阿里云原创内容,未经容许不得转载。

退出移动版