关于join:技术分享-咬文嚼字之驱动表-outer表

31次阅读

共计 4957 个字符,预计需要花费 13 分钟才能阅读完成。

作者:胡呈清

爱可生 DBA 团队成员,善于故障剖析、性能优化,集体博客:https://www.jianshu.com/u/a95…,欢送探讨。

本文起源:原创投稿

* 爱可生开源社区出品,原创内容未经受权不得随便应用,转载请分割小编并注明起源。


什么是驱动表?

什么是 outer 表和 inner 表?

outer 表等同于驱动表吗?

在 MySQL 中这个问题的脉络

  1. MySQL 的 join 算法:Nested-Loop 和其变种算法 Block Nested-Loop

MySQL8.0.18 引入的 hash join 其实能够算是对 Block Nested-Loop 的优化(MySQL8.0.20 之前 explain 输入显示的还是 Block Nested Loop,须要用 explain format=tree 能力看到应用了 hash join),直到 MySQL8.0.20 删除了 Block Nested-Loop,hash join 正式上位。

  1. Nested-Loop 算法:外循环和内循环

t1、t2 两表关联时,最简略的 Nested-Loop 的算法如下:

for each row in t1 matching range {
    for each row in t2 {if row satisfies join conditions, send to client}
}

这个算法的意思就是:每次将一行数据从外循环传递到内循环进行比照。而外循环中的表就叫 outer 表,内循环中的表就是 inner 表。

在 MySQL 文档中没有任何对于驱动表 (drving table) 的形容和定义,不过咱们能够参考 Oracle 的这个帖子:https://asktom.oracle.com/pls…

The 'driving' table is the table we will join FROM -- that is JOIN TO other tables.

这意思多少有点形象了,不过别急,咱们再推敲下下面的 Nested-Loop 算法,不就是 outer 表“驱动”inner 表的意思吗?所以啊,“驱动表”其实就是 outer 表。

  1. Block Nested-Loop(BNL)的由来

依照 Nested-Loop 算法,如果 inner 表的关联字段有索引,则在内循环中 inner 表能够利用索引查找数据,查找次数等于 outer 表行数(满足其余条件)。然而如果 inner 表的关联字段没有索引,则每次 inner 表都须要全表扫描,为了缩小 inner 表的全表扫描次数,每次从 outer 表中会取出多行数据寄存到 join buffer 中,并把 join buffer 传递到内循环中,则能够将内循环 inner 表中读取的每一行与 join buffer 中的所有行进行比拟。这样 inner 表的读取次数显著缩小,如果 join buffer 可能放下 outer 表的所有行,则 inner 表只须要读取一次(一次全表扫描)。

留神:放进内存 (join buffer) 的是驱动表。

  1. Hash Join 的由来

BNL 算法在 join buffer 中保护的是一个无序数组,所以每次在 join buffer 中查找都要遍历所有行。不难看出 BNL 算法比照 Nested-Loop 缩小的是 IO 老本,CPU 老本并没有升高,比照次数都等于 outer、inner 表行数的乘积。hash join 就是在此算法的根底上,在 join buffer 中保护一个哈希表,每次查找做一次判断就能找到数据,这样一来 CPU 老本也显著升高。

  1. outer 表、驱动表的抉择

对于 left join、right join 来说,其语义曾经固定了 outer 表的抉择,没啥探讨空间(除非 where 子句中突破了其语义)。对于 inner join,outer 表的抉择是由优化器说了算的,举例:
select * from t1 join t2 on t1.a=t2.a;

a. 如果 t1.a、t2.a 都有索引,且基数高,则效率最高的算法是 Nested-Loop,因为有索引,通常咱们会改称其为 Index Nested-Loop,则会抉择小表作为 outer 表,这样循环的次数会更少;

b. 如果 t1.a、t2.a 都没有索引,基于老本的思考,则优化器会抉择 BNL 算法或者 hash join,因为 outer 表要放入 join buffer 中,而这块内存的大小是依据 join_buffer_size 参数指定的,容量无限,所以还是会抉择小表作为 outer 表;

c. 当然也不肯定都是抉择小表作为 outer 表,如果 t1 表有 10 万行数据,t1.a 有索引;而 t2 表有 20 万行数据,t2.a 没有索引。则优化器很可能抉择 t2 表作为 outer 表,因为 Index Nested-Loop 算法必定比 BNL 算法老本更低,也可能比 hash join 算法成本低。

例子比较简单,理论状况会更简单,比方 SQL 中多半还会有 where 子句,这时候小表的定义就不是 t1、t2 的整表大小了,而是 t1、t2 利用完 where 子句后的数据大小,本篇不做过多探讨。

其余数据库

从上文能够看出,outer 表脱胎于“外循环”,而外循环严格来说是 Nested-Loop 算法中的定义。翻阅多个数据库的文档(见下文),其实在形容其余 join 算法时(Hash Join、Merge Join)都没有呈现“outer table”,所以不禁会产生一种疑难:如果不是 Nested-Loop 算法,会有 outer 表的说法吗?

但从上文也能够看出,其实 Hash Join 实质上还是一种“循环连贯”算法,包含 MySQL 没有实现的 Merge Join 算法也一样,所以我个人观点是:

  • 在 Join 查问中,数据库扫描第一个表为驱动表,同时也是 outer table。

informix 对于 outer table 的形容

见链接:https://www.ibm.com/docs/sr/i…

In a nested-loop join, the database server scans the first, or outer table, and then joins each of the rows that pass table filters to the rows found in the second, or inner table.

sybase 对于 outer table 的形容

见链接:https://infocenter.sybase.com…

  • Inner and outer tables
    The terms outer table and inner table describe the placement of the tables in an outer join:
  • In a left join, the outer table and inner table are the left and right tables respectively. The outer table and inner table are also referred to as the row-preserving and null-supplying tables, respectively.
    In a right join, the outer table and inner table are the right and left tables, respectively.

Oracle 对于 outer table 的形容

How Nested Loops Joins Work 章节

The inner loop is executed for every row of the outer loop. The employees table is the “outer” data set because it is in the exterior forloop. The outer table is sometimes called a driving table. Thedepartments table is the “inner” data set because it is in the interior for loop.

A nested loops join involves the following basic steps:

  1. The optimizer determines the driving row source and designates it as the outer loop.

The outer loop produces a set of rows for driving the join condition. The row source can be a table accessed using an index scan, a full table scan, or any other operation that generates rows.

The number of iterations of the inner loop depends on the number of rows retrieved in the outer loop. For example, if 10 rows are retrieved from the outer table, then the database must perform 10 lookups in the inner table. If 10,000,000 rows are retrieved from the outer table, then the database must perform 10,000,000 lookups in the inner table.

outer join 章节:

In ANSI syntax, the OUTER JOIN clause specifies an outer join. In the FROM clause, the left table appears to the left of the OUTER JOIN keywords, and the right table appears to the right of these keywords. The left table is also called the outer table, and the right table is also called the inner table.

Nested Loops Outer Joins 章节:

An outer join returns all rows that satisfy the join condition and also rows from one table for which no rows from the other table satisfy the condition. Thus, the result set of an outer join is the superset of an inner join.

In ANSI syntax, the OUTER JOIN clause specifies an outer join. In the FROM clause, the left table appears to the left of the OUTER JOIN keywords, and the right table appears to the right of these keywords. The left table is also called the outer table, and the right table is also called the inner table. For example, in the following statement the employees table is the left or outer table:

Outer joins require the outer-joined table to be the driving table. In the preceding example, employees is the driving table, and departments is the driven-to table.

Hash Join Outer Joins 章节:

The optimizer uses hash joins for processing an outer join when either the data volume is large enough to make a hash join efficient, or it is impossible to drive from the outer table to the inner table.

The cost determines the order of tables. The outer table, including preserved rows, may be used to build the hash table, or it may be used to probe the hash table.

正文完
 0