原文地址:【MySQL 源码】UNION 比 UNION ALL 的性能差很多吗?
欢送拜访我的集体博客: http://blog.duhbb.com/
引言
本文从源码角度剖析了一下 MySQL 中 union 和 union all 的区别; 得出了以下论断: union 和 union all 都会创立长期表, 然而又不太一样; 二者的查问打算不一样;union 默认会创立一个以返回列作为 key 的长期表, 所谓过滤就是将数据插入这个长期表; 长期表装数据的容器实际上是一个 unordered_set; 有一种存储引擎叫做长期表; union all 则是间接读取表的数据并返回给客户端, 不走长期表; union all 和 union 的场景还是得依据须要来判断, 如果没有 distinct 的需要话, 数据又不多, 能够思考应用 union all.
Union 和 Union All 的区别
Union 和 Union All 之间的惟一区别是 Union All 不会删除反复的行或记录, 而是从所有表中抉择满足您的具体查问条件的所有行并将它们组合到后果表中.
UNION 不适用于具备文本数据类型的列. 而 UNION ALL 实用于所有数据类型列.
MySQL 官网介绍
MySQL 官网文档在介绍 12.5 Non-Subquery UNION Execution 是这么说的:
非子查问联结 (non-subquery unions) 是在 mysql_union()
的帮忙下实现的.
目前, 它分为以下步骤:
-
st_select_lex_unit::prepare
(对于对单个 SELECT 的派生表能够调用雷同的过程, 咱们在此过程中反对它, 但咱们不会在这里形容它):- 创立
select_union
(继承自select_result
), 将在此长期表中写入抉择后果, 长期表条目为空. 咱们将须要在这个对象存储在它下面的每个 JOIN 构造, 但咱们 (还没有) 长期表构造. - 调配 JOIN 构造并为每个 SELECT 执行 JOIN::prepare() 以获取无关 SELECT 列表元素类型 ( 后果) 的残缺信息. 在此循环中实现合并后果字段类型以及存储在非凡我的项目 (Item_type_holder) 中. 此操作的后果 (后果字段类型列表) 将存储在
st_select_lex_unit::types
中. - 创立一个长期表用于存储联结后果 (如果 UNION 没有 ALL 选项, ‘distinct’ 参数将传递给表创立过程).
- 为第一步中创立 select_union 的对象调配一个长期表 .
- 创立
-
st_select_lex_unit::exec
- 如果这不是第一次调用, 从长期表中删除行.
- 如果这是第一次调用, 则调用
JOIN::optimize
, 否则调用JOIN::reinit
, 而后为所有的 SELECT 调用JOIN::exec
(select_union 将为长期表写入后果). 如果联结是可缓存的并且这不是第一次调用, 则该办法将什么也不做. - 从所有 SELECT 收集后果后, 对长期表调用具备全局 ORDER BY 和 LIMIT 参数的 mysql_select. 为每个 UNION 创立的非凡的 fake_select_lex (SELECT_LEX) 将会传给个过程 (如果在查问中应用了括号, 那么 SELECT_LEX 也会存储全局的 ORDER BY 和 LIMIT 参数).
官网文档有点坑啊, 我都没有这两个办法: st_select_lex_unit::prepare
, st_select_lex_unit::exec
.
调试跟踪
在 sql_union.cc
943 行, 这个文件中有个办法:
void Query_expression::create_access_paths(THD *thd) {
// 确定咱们是否能够流式读取行, 即永远不须要将它们放到长期表中
// 如果能够的话, 咱们会首先物化 UNION DISTINCT blocks, 而后将残余任何 UNION ALL block
// 通过 AppendIterator 追加.
//
// 如果不能流式的话, 即每个块都必须进入长期表
// 咱们对于混合的 UNION ALL/DISTINCT 的策略有点不同
// 详情见 MaterializeIterator.
bool streaming_allowed = true;
if (global_parameters()->order_list.size() != 0) {
// If we're sorting, we currently put it in a real table no matter what.
// This is a legacy decision, because we used to not know whether filesort
// would want to refer to rows in the table after the sort (sort by row ID).
// We could probably be more intelligent here now.
streaming_allowed = false;
}
// 省略后面
// 如果容许流式查问, 那么咱们能够对 UNION ALL 的每个局部都做流式查问,
// 而其余状况则都须要用到长期表.
//
// 解决咱们须要物化的所有的 query block.
// 这个可能是 UNION DISTINCT 的 query block 或者所有的 block.
if (union_distinct != nullptr || ! streaming_allowed) {
Mem_root_array<MaterializePathParameters::QueryBlock> query_blocks =
setup_materialization(thd, tmp_table, streaming_allowed);
// 省略前面
光看代码感觉一头雾水, 还是 debug 一下吧.
if (! simple_query_expression) {
/*
Check that it was possible to aggregate all collations together for UNION.
We need this in case of UNION DISTINCT, to filter out duplicates using
the proper collation.
TODO: consider removing this test in case of UNION ALL.
*/
for (Item *type : types) {if (type->result_type() == STRING_RESULT &&
type->collation.derivation == DERIVATION_NONE) {my_error(ER_CANT_AGGREGATE_NCOLLATIONS, MYF(0), "UNION");
return true;
}
}
ulonglong create_options =
first_query_block()->active_options() | TMP_TABLE_ALL_COLUMNS;
if (union_result->create_result_table(thd, types, union_distinct != nullptr,
create_options, "", false,
instantiate_tmp_table))
这里执行的语句是:
select * from student union select * from student;
能够看到这里的确创立了长期表, 是在 sql_union.cc
这个文件的 prepare
办法中:
bool Query_expression::prepare(THD *thd, Query_result *sel_result,
创立长期表调用的是这个办法:
/**
Create a temp table according to a field list.
Given field pointers are changed to point at tmp_table for
send_result_set_metadata. The table object is self contained: it's
allocated in its own memory root, as well as Field objects
created for table columns. Those Field objects are common to TABLE and
TABLE_SHARE.
This function will replace Item_sum items in 'fields' list with
corresponding Item_field items, pointing at the fields in the
temporary table, unless save_sum_fields is set to false.
The Item_field objects are created in THD memory root.
@param thd thread handle
@param param a description used as input to create the table
@param fields list of items that will be used to define
column types of the table (also see NOTES)
@param group Group key to use for temporary table, NULL if
none
@param distinct should table rows be distinct
@param save_sum_fields see NOTES
@param select_options
@param rows_limit
@param table_alias possible name of the temporary table that can
be used for name resolving; can be "".
@remark mysql_create_view() checks that views have less than
MAX_FIELDS columns.
@remark We may actually end up with a table without any columns at all.
See comment below: We don't have to store this.
*/
#define STRING_TOTAL_LENGTH_TO_PACK_ROWS 128
#define AVG_STRING_LENGTH_TO_PACK_ROWS 64
#define RATIO_TO_PACK_ROWS 2
TABLE *create_tmp_table(THD *thd, Temp_table_param *param,
const mem_root_deque<Item *> &fields, ORDER *group,
bool distinct, bool save_sum_fields,
ulonglong select_options, ha_rows rows_limit,
const char *table_alias) {
其中有一段代码是这样的:
} else if (distinct && share->fields != param->hidden_field_count) {
/*
Create an unique key or an unique constraint over all columns
that should be in the result. In the temporary table, there are
'param->hidden_field_count' extra columns, whose null bits are stored
in the first 'hidden_null_pack_length' bytes of the row.
*/
DBUG_PRINT("info", ("hidden_field_count: %d", param->hidden_field_count));
share->keys = 1;
share->is_distinct = true;
if (! unique_constraint_via_hash_field) {
param->keyinfo->table = table;
param->keyinfo->is_visible = true;
param->keyinfo->user_defined_key_parts =
share->fields - param->hidden_field_count;
param->keyinfo->actual_key_parts = param->keyinfo->user_defined_key_parts;
KEY_PART_INFO *key_part_info = share->mem_root.ArrayAlloc<KEY_PART_INFO>(param->keyinfo->user_defined_key_parts);
if (key_part_info == nullptr) return nullptr;
param->keyinfo->key_part = key_part_info;
param->keyinfo->flags = HA_NOSAME | HA_NULL_ARE_EQUAL;
param->keyinfo->actual_flags = param->keyinfo->flags;
param->keyinfo->name = "<auto_distinct_key>";
// keyinfo->algorithm is set later, when storage engine is known
param->keyinfo->set_rec_per_key_array(nullptr, nullptr);
param->keyinfo->set_in_memory_estimate(IN_MEMORY_ESTIMATE_UNKNOWN);
/* 要害之处: 给咱们想要返回的列搞一个整体的 distinct key */
for (unsigned i = param->hidden_field_count; i < share->fields;
i++, key_part_info++) {key_part_info->init_from_field(table->field[i]);
if (key_part_info->store_length > max_key_part_length) {
unique_constraint_via_hash_field = true;
break;
}
}
table->key_info = param->keyinfo;
share->key_info = param->keyinfo;
share->key_parts = param->keyinfo->user_defined_key_parts;
}
}
感觉从这里仿佛了解了 MySQL 是如何做过滤的了, 它会创立一个长期表, 而后给要返回的字段建一个 distinct key
, 如此一来长期表也会有索引咯? 而后再插入这个表的时候判断是否曾经有雷同的列了? 刮目相待吧!
果然, 调试的时候跟踪到了这里:
Hash_unique::Hash_unique(const Table &table, const KEY &mysql_index,
const Allocator<Indexed_cells> &allocator)
: Index(table, mysql_index),
m_hash_table(INDEX_DEFAULT_HASH_TABLE_BUCKETS, Indexed_cells_hash(*this),
Indexed_cells_equal_to(*this), allocator) {}
Result Hash_unique::insert(const Indexed_cells &indexed_cells,
Cursor *insert_position) {
std::pair<Container::iterator, bool> r;
try {
// m_hash_table 就是个 unordered_set
r = m_hash_table.emplace(indexed_cells);
} catch (Result ex) {return ex;}
auto &pos = r.first;
const bool new_element_inserted = r.second;
// 就是这里了, 判断是否插入胜利
if (! new_element_inserted) {return Result::FOUND_DUPP_KEY;}
*insert_position = Cursor(pos);
return Result::OK;
}
下面这个代码的门路是在: /Users/tuhooo/mysql-server/storage/temptable/src/index.cc
贴个图片留念一下:
它喵的, 这个长期表保留记录其实就是用了一个 unordered_set
, 笑哭!
/**
* @brief Attempts to build and insert an element into the
* %unordered_set.
* @param __args Arguments used to generate an element.
* @return A pair, of which the first element is an iterator that points
* to the possibly inserted element, and the second is a bool
* that is true if the element was actually inserted.
*
* This function attempts to build and insert an element into the
* %unordered_set. An %unordered_set relies on unique keys and thus an
* element is only inserted if it is not already present in the
* %unordered_set.
*
* Insertion requires amortized constant time.
*/
template<typename... _Args>
std::pair<iterator, bool>
emplace(_Args&&... __args)
{return _M_h.emplace(std::forward<_Args>(__args)...); }
长期表其实是一种存储引擎.
实质上还是创立的查问打算不一样, 就是在下边这段代码了:
m_root_iterator = CreateIteratorFromAccessPath(thd, m_root_access_path, join, /*eligible_for_batch_mode=*/true);
if (m_root_iterator == nullptr) {return true;}
它来自于: sql_union.cc 中的 optimize 办法:
bool Query_expression::optimize(THD *thd, TABLE *materialize_destination,
bool create_iterators,
bool finalize_access_paths) {
查问打算
union all
mysql> explain select * from student union all select * from student \G;
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: student
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
*************************** 2. row ***************************
id: 2
select_type: UNION
table: student
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
2 rows in set, 1 warning (6.53 sec)
ERROR:
No query specified
mysql> show warnings;
+-------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note | 1003 | /* select#1 */ select `test`.`student`.`id` AS `id`,`test`.`student`.`name` AS `name`,`test`.`student`.`phone` AS `phone` from `test`.`student` union all /* select#2 */ select `test`.`student`.`id` AS `id`,`test`.`student`.`name` AS `name`,`test`.`student`.`phone` AS `phone` from `test`.`student` |
+-------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (3.24 sec)
union
mysql> explain select * from student union select * from student \G;
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: student
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
*************************** 2. row ***************************
id: 2
select_type: UNION
table: student
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
*************************** 3. row ***************************
id: NULL
select_type: UNION RESULT
table: <union1,2>
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
filtered: NULL
Extra: Using temporary
3 rows in set, 1 warning (7.19 sec)
ERROR:
No query specified
mysql> show warnings;
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note | 1003 | /* select#1 */ select `test`.`student`.`id` AS `id`,`test`.`student`.`name` AS `name`,`test`.`student`.`phone` AS `phone` from `test`.`student` union /* select#2 */ select `test`.`student`.`id` AS `id`,`test`.`student`.`name` AS `name`,`test`.`student`.`phone` AS `phone` from `test`.`student` |
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (4.04 sec)
嘿嘿, 刚好又把我之前学的一点点 explain
常识给串起来了.
Extra: Using temporary
示意应用了长期表.
Using temporary
为了解析查问, MySQL 须要创立一个长期表来保留后果. 如果查问蕴含以不同形式列出列的 GROUP BY 和 ORDER BY 子句, 通常会产生这种状况.
如果对于查问打算不相熟的, 能够参考我翻译和整顿的这篇博客:【MySQL 文档翻译】了解查问打算
总结
- union 和 union all 都会创立长期表, 然而又不太一样
- 二者的查问打算不一样
- union 默认会创立一个以返回列作为 key 的长期表, 所谓过滤就是将数据插入这个长期表
- 长期表装数据的容器实际上是一个 unordered_set
- 有一种存储引擎叫做长期表
- union all 则是间接读取表的数据并返回给客户端, 不走长期表
- union all 和 union 的场景还是得依据须要来判断, 如果没有 distinct 的需要话, 数据又不多, 能够思考应用 union all
原文地址:【MySQL 源码】UNION 比 UNION ALL 的性能差很多吗?
欢送拜访我的集体博客: http://blog.duhbb.com/