本文首发于 2021-01-26 21:31:12

《ClickHouse和他的敌人们》系列文章转载自圈内好友 BohuTANG 的博客,原文链接:
https://bohutang.me/2021/01/2...
以下为注释。

在揭秘 ClickHouse Group By 之前,先聊聊数据库的性能比照测试问题。

在虎哥看来,一个“讲武德”的性能比照测试应该提供什么信息呢?

首先要尊重客观事实,在什么场景下,x 比 y 快?
其次是为什么 x 会比 y 快?

如果以上两条都做到了,还有一点也比拟重要: x 的劣势能够撑持多久? 是架构等带来的长期劣势,还是一袋烟的优化所得,是否能继续跟上本人的灵魂。

如果只是贴几个妖艳的数字,算不上是 benchmark,而是 benchmarket。

好了,回到 Group By 正题。

置信很多同学曾经体验到 ClickHouse Group By 的杰出性能,本篇就来剖析下快的起因。

首先刺激一下,ClickHouse 的 Group By 并没有应用高大上的黑科技,只是摸索了一条绝对较优的计划。

一条 SQL

SELECT sum(number) FROM numbers(10) GROUP BY number % 3

咱们就以这条简略的 SQL 作为线索,看看 ClickHouse 怎么实现 Group By 聚合。

1. 生成 AST

EXPLAIN ASTSELECT sum(number)FROM numbers(10)GROUP BY number % 3┌─explain─────────────────────────────────────┐│ SelectWithUnionQuery (children 1)           ││  ExpressionList (children 1)                ││   SelectQuery (children 3)                  ││    ExpressionList (children 1)              ││     Function sum (children 1)               │  // sum 聚合│      ExpressionList (children 1)            ││       Identifier number                     ││    TablesInSelectQuery (children 1)         ││     TablesInSelectQueryElement (children 1) ││      TableExpression (children 1)           ││       Function numbers (children 1)         ││        ExpressionList (children 1)          ││         Literal UInt64_10                   ││    ExpressionList (children 1)              ││     Function modulo (children 1)            │  // number % 3 函数│      ExpressionList (children 2)            ││       Identifier number                     ││       Literal UInt64_3                      │└─────────────────────────────────────────────┘

2. 生成 Query Plan

EXPLAINSELECT sum(number)FROM numbers(10)GROUP BY number % 3┌─explain───────────────────────────────────────────────────────────────────────┐│ Expression ((Projection + Before ORDER BY))                                   │ │   Aggregating                                                                 │ // sum 聚合│     Expression (Before GROUP BY)                                              │ // number % 3│       SettingQuotaAndLimits (Set limits and quota after reading from storage) ││         ReadFromStorage (SystemNumbers)                                       │└───────────────────────────────────────────────────────────────────────────────┘

代码次要在 InterpreterSelectQuery::executeImpl@Interpreters/InterpreterSelectQuery.cpp

3. 生成 Pipeline

EXPLAIN PIPELINESELECT sum(number)FROM numbers(10)GROUP BY number % 3┌─explain───────────────────────┐│ (Expression)                  ││ ExpressionTransform           ││   (Aggregating)               ││   AggregatingTransform        │  // sum 计算│     (Expression)              ││     ExpressionTransform       │  // number % 3 计算│       (SettingQuotaAndLimits) ││         (ReadFromStorage)     │└───────────────────────────────┘

4. 执行 Pipeline

Pipeline 是从底部往上逐个执行。

4.1 ReadFromStorage

首先从 ReadFromStorage 执行,生成一个 block1, 数据如下:

┌─number─┐│      0 ││      1 ││      2 ││      3 ││      4 ││      5 ││      6 ││      7 ││      8 ││      9 │└────────┘number类型为 UInt64

4.2 ExpressionTransform

ExpressionTransform 蕴含了 2 个 action:

  1. 名字为 number,type 为 INPUT
  2. 名字为 modulo(number, 3), type 为 FUNCTION

通过 ExpressionTransform 运行解决后生成一个新的 block2, 数据如下:

┌─number─┬─modulo(number, 3)─┐│      0 │                 0 ││      1 │                 1 ││      2 │                 2 ││      3 │                 0 ││      4 │                 1 ││      5 │                 2 ││      6 │                 0 ││      7 │                 1 ││      8 │                 2 ││      9 │                 0 │└────────┴───────────────────┘number 类型为 UInt64modulo(number, 3) 类型为 UInt8

代码次要在 ExpressionActions::execute@Interpreters/ExpressionActions.cpp

4.3 AggregatingTransform

AggregatingTransform 是 Group By 高性能的外围所在。
本示例中的 modulo(number, 3) 类型为 UInt8,在做优化上,ClickHouse 会抉择应用数组代替 hashtable作为分组,辨别逻辑见 Interpreters/Aggregator.cpp

在计算 sum 的时候,首先会生成一个数组 [1024],而后做了一个编译开展(代码 addBatchLookupTable8@AggregateFunctions/IAggregateFunction.h):

static constexpr size_t UNROLL_COUNT = 4;std::unique_ptr<Data[]> places{new Data[256 * UNROLL_COUNT]};bool has_data[256 * UNROLL_COUNT]{}; /// Separate flags array to avoid heavy initialization.size_t i = 0;/// Aggregate data into different lookup tables.size_t batch_size_unrolled = batch_size / UNROLL_COUNT * UNROLL_COUNT;for (; i < batch_size_unrolled; i += UNROLL_COUNT){    for (size_t j = 0; j < UNROLL_COUNT; ++j)    {        size_t idx = j * 256 + key[i + j];        if (unlikely(!has_data[idx]))        {            new (&places[idx]) Data;            has_data[idx] = true;        }        func.add(reinterpret_cast<char *>(&places[idx]), columns, i + j, nullptr);    }}

sum(number) … GROUP BY number % 3 计算形式:

array[0] = 0 + 3 + 6 + 9 = 18array[1] = 1 + 4 + 7 = 12array[2] = 2 + 5 + 8 = 15

这里只是针对 UInt8 做的一个优化分支,那么对于其余类型怎么优化解决呢?
ClickHouse 针对不同的类型别离提供了不同的 hashtable,声势比拟盛大(代码见 Aggregator.h):

using AggregatedDataWithUInt8Key = FixedImplicitZeroHashMapWithCalculatedSize<UInt8, AggregateDataPtr>;using AggregatedDataWithUInt16Key = FixedImplicitZeroHashMap<UInt16, AggregateDataPtr>;using AggregatedDataWithUInt32Key = HashMap<UInt32, AggregateDataPtr, HashCRC32<UInt32>>;using AggregatedDataWithUInt64Key = HashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>;using AggregatedDataWithShortStringKey = StringHashMap<AggregateDataPtr>;using AggregatedDataWithStringKey = HashMapWithSavedHash<StringRef, AggregateDataPtr>;using AggregatedDataWithKeys128 = HashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>;using AggregatedDataWithKeys256 = HashMap<DummyUInt256, AggregateDataPtr, UInt256HashCRC32>;using AggregatedDataWithUInt32KeyTwoLevel = TwoLevelHashMap<UInt32, AggregateDataPtr, HashCRC32<UInt32>>;using AggregatedDataWithUInt64KeyTwoLevel = TwoLevelHashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>;using AggregatedDataWithShortStringKeyTwoLevel = TwoLevelStringHashMap<AggregateDataPtr>;using AggregatedDataWithStringKeyTwoLevel = TwoLevelHashMapWithSavedHash<StringRef, AggregateDataPtr>;using AggregatedDataWithKeys128TwoLevel = TwoLevelHashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>;using AggregatedDataWithKeys256TwoLevel = TwoLevelHashMap<DummyUInt256, AggregateDataPtr, UInt256HashCRC32>;using AggregatedDataWithUInt64KeyHash64 = HashMap<UInt64, AggregateDataPtr, DefaultHash<UInt64>>;using AggregatedDataWithStringKeyHash64 = HashMapWithSavedHash<StringRef, AggregateDataPtr, StringRefHash64>;using AggregatedDataWithKeys128Hash64 = HashMap<UInt128, AggregateDataPtr, UInt128Hash>;using AggregatedDataWithKeys256Hash64 = HashMap<DummyUInt256, AggregateDataPtr, UInt256Hash>;

如果咱们改成 GROUP BY number*100000 后,它会抉择 AggregatedDataWithUInt64Key 的 hashtable 作为分组。

而且 ClickHouse 提供了一种 Two Level 形式,用语应答有大量分组 key 的状况,Level1 先分大组,Level2 小组能够并行计算。

针对 String 类型,依据不同的长度,hashtable 也做了很多优化,代码见 HashTable/StringHashMap.h

总结

ClickHouse 会依据 Group By 的最终类型,抉择一个最优的 hashtable 或数组,作为分组根底数据结构,使内存和计算尽量最优。

这个”最优解“是怎么找到的?从 test 代码能够看出,是不停的尝试、测试验证进去的,浓重的 bottom-up 哲学范。

hashtable 测试代码:Interpreters/tests

lookuptable 测试代码: tests/average.cpp


欢送关注我的微信公众号【数据库内核】:分享支流开源数据库和存储引擎相干技术。

题目网址
GitHubhttps://dbkernel.github.io
知乎https://www.zhihu.com/people/...
思否(SegmentFault)https://segmentfault.com/u/db...
掘金https://juejin.im/user/5e9d3e...
开源中国(oschina)https://my.oschina.net/dbkernel
博客园(cnblogs)https://www.cnblogs.com/dbkernel