关于列式数据库:ClickHouse-Merge性能测试

65次阅读

共计 7672 个字符,预计需要花费 20 分钟才能阅读完成。

ClickHouse 性能测试

为了验证 ClickHouse 性能,将结合实际业务场景对 clickhouse 进行多维度测试。

造测试数据

在理论业务中最常见的业务场景,有二张表,订单主表和订单明细表
通常二张表会 join 查问,或者 group by 查问,上面就会应用 clickhouse 对这种状况进行测试

定义表构造

test_order: 主表
表构造:

CREATE TABLE `test_order` (`id` bigint(11) NOT NULL AUTO_INCREMENT,
  `field_name_1` varchar(60) NOT NULL,
  `field_name_2` varchar(60) NOT NULL,
  `field_name_3` varchar(60) NOT NULL,
  `field_name_4` varchar(60) NOT NULL,
  `field_name_5` varchar(60) NOT NULL,
  `field_name_6` varchar(60) NOT NULL,
  `field_name_7` varchar(60) NOT NULL,
  `field_name_8` varchar(60) NOT NULL,
  `field_name_9` varchar(60) NOT NULL,
  `field_name_10` varchar(60) NOT NULL,
  `field_id_1` int(11) NOT NULL,
  `field_id_2` int(11) NOT NULL,
  `field_id_3` int(11) NOT NULL,
  `field_id_4` int(11) NOT NULL,
  `field_id_5` int(11) NOT NULL,
  `field_id_6` int(11) NOT NULL,
  `field_id_7` int(11) NOT NULL,
  `field_id_8` int(11) NOT NULL,
  `field_id_9` int(11) NOT NULL,
  `field_id_10` int(11) NOT NULL,
  `field_date_1` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_2` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_3` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_4` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_5` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_6` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_7` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_8` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_9` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_field_1` (`field_name_1`,`field_id_1`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1043 DEFAULT CHARSET=utf8mb4;

test_order_detail: 明细表,为了减少 sql 查问简单的,定义了 41 个字段
表构造

CREATE TABLE `test_order_detail` (`id` bigint(11) NOT NULL AUTO_INCREMENT,
  `order_id` bigint(11) NOT NULL,
  `field_name_1` varchar(60) NOT NULL,
  `field_name_2` varchar(60) NOT NULL,
  `field_name_3` varchar(60) NOT NULL,
  `field_name_4` varchar(60) NOT NULL,
  `field_name_5` varchar(60) NOT NULL,
  `field_name_6` varchar(60) NOT NULL,
  `field_name_7` varchar(60) NOT NULL,
  `field_name_8` varchar(60) NOT NULL,
  `field_name_9` varchar(60) NOT NULL,
  `field_name_10` varchar(60) NOT NULL,
  `field_name_11` varchar(60) NOT NULL,
  `field_name_12` varchar(60) NOT NULL,
  `field_name_13` varchar(60) NOT NULL,
  `field_name_14` varchar(60) NOT NULL,
  `field_name_15` varchar(60) NOT NULL,
  `field_name_16` varchar(60) NOT NULL,
  `field_name_17` varchar(60) NOT NULL,
  `field_name_18` varchar(60) NOT NULL,
  `field_name_19` varchar(60) NOT NULL,
  `field_name_20` varchar(60) NOT NULL,
  `field_id_1` int(11) NOT NULL,
  `field_id_2` int(11) NOT NULL,
  `field_id_3` int(11) NOT NULL,
  `field_id_4` int(11) NOT NULL,
  `field_id_5` int(11) NOT NULL,
  `field_id_6` int(11) NOT NULL,
  `field_id_7` int(11) NOT NULL,
  `field_id_8` int(11) NOT NULL,
  `field_id_9` int(11) NOT NULL,
  `field_id_10` int(11) NOT NULL,
  `field_date_1` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_2` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_3` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_4` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_5` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_6` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_7` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_8` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `field_date_9` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_order_id` (`order_id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=18129081 DEFAULT CHARSET=utf8mb4;

写入测试数据到 mysql

test_order是主表,插入 1024 行数据

test_order_detail表是重头戏,这里分批次写入 1800 万行数据,每列数据均应用随机函数生成,代码比较简单,就不展现了

mysql 数据存储目录,.ibd文件是 test_order_detail 表的数据和索引文件内容,曾经达到了13G,数据量很大了

-rw-r-----@ 1 jiao  staff    14K  8 15 12:46 test_order_detail.frm
-rw-r-----@ 1 jiao  staff    13G  8 16 20:30 test_order_detail.ibd

从 mysql 查问数据写到.csv

利用 clickhouse 能够间接读取 csv 文件插入到表中个性
这里从 mysql 中每次读 10 万 数据写入一个 csv 文件
生成了 180 多个 .csv 文件

➜  csv ll
total 29852872
-rw-r--r--  1 jiao  staff    71M  8 21 18:10 1.csv
-rw-r--r--  1 jiao  staff    74M  8 21 18:10 10.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:15 100.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:15 101.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:15 102.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:15 103.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:15 104.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 105.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 106.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 107.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 108.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 109.csv
-rw-r--r--  1 jiao  staff    75M  8 21 18:10 11.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 110.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 111.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 112.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 113.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 114.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 115.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 116.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 117.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:16 118.csv
-rw-r--r--  1 jiao  staff    78M  8 21 18:17 119.csv

应用 php 将 csv 文件插入到 clickhouse

装置 php 语言 clickhouse 第三方包:https://github.com/smi2/phpClickHouse
该第三方包应用的是 http 协定
先在 clickhouse 中创立表

CREATE TABLE test.test_order_detail
(
    `id` Int64,
    `order_id` Int64,
    `field_name_1` String,
    `field_name_2` String,
    `field_name_3` String,
    `field_name_4` String,
    `field_name_5` String,
    `field_name_6` String,
    `field_name_7` String,
    `field_name_8` String,
    `field_name_9` String,
    `field_name_10` String,
    `field_name_11` String,
    `field_name_12` String,
    `field_name_13` String,
    `field_name_14` String,
    `field_name_15` String,
    `field_name_16` String,
    `field_name_17` String,
    `field_name_18` String,
    `field_name_19` String,
    `field_name_20` String,
    `field_id_1` Int64,
    `field_id_2` Int64,
    `field_id_3` Int64,
    `field_id_4` Int64,
    `field_id_5` Int64,
    `field_id_6` Int64,
    `field_id_7` Int64,
    `field_id_8` Int64,
    `field_id_9` Int64,
    `field_id_10` Int64,
    `field_date_1` DateTime,
    `field_date_2` DateTime,
    `field_date_3` DateTime,
    `field_date_4` DateTime,
    `field_date_5` DateTime,
    `field_date_6` DateTime,
    `field_date_7` DateTime,
    `field_date_8` DateTime,
    `field_date_9` DateTime
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192

执行脚本 php 脚本,代码比较简单,局部代码如下

$begin = microtime(true);
        $config = [
            'host'     => '172.16.101.134',
            'port'     => '8123',
            'username' => 'caps',
            'password' => '123456'
        ];
        $db     = new Client($config);
        $db->database('test');
        $db->setTimeout(60);       // 10 seconds
        $db->setConnectTimeOut(50); // 5 seconds
//        $tables = $db->showTables();

        //insert from csv
        $connect = microtime(true);
        for ($j = 1; $j <= 1; $j++) {$file_data_names = [];
            for ($i = 1; $i <= 1; $i++) {$file_data_names[] = __DIR__ . DIRECTORY_SEPARATOR . 'csv' . DIRECTORY_SEPARATOR . ($j) . '.csv';
            }
            $db->insertBatchFiles('test_order_detail_tmp', $file_data_names);
            usleep(1000);
        }
        echo microtime(true) - $begin . PHP_EOL;
        echo microtime(true) - $connect . PHP_EOL;

插入数据性能测试

表没有定义分区,每行数据随机生成,一共有 42 列,每行数据量 0.8k 左右

批量插入行数 耗时 数据量
1 千 0.05s 0.7M
1 万 0.25s 7.1M
5 万 1.0s 36M
10 万 2.0s 73M
20 万 3.6s 146M

在不同机器上测试后果可能出入很大,从本机器测试后果来看,每次插入数据适宜 1k – 5w,能够保障 1 秒之内就能胜利。

插入数据可能会呈现的谬误
1. 若设置了分区键,而插入的数据会导致分区太多,则插入失败,默认最大 100 个分区
2. 插入数据太多导致的内存溢出

数据压缩比

1800 万数据量
Mysql 占用存储空间:13G
ClickHouse 中占用:4.1G

因为所有字段都是随机生成,3 倍多数据压缩比曾经很高了,且 lz4 压缩算法的解压效率也十分高

查问性能测试

test_order_detail1800 万数据
test_order1000行数据
上面对业务中比拟罕用的 sql 进行测试

Test1

select count(*) from test.test_order_detail

统计总条数,十分常见的 sql 了吧,ClickHousecount.txt 文件中保留了总条数,所以返回的确很快

Mysql 耗时 ClickHouse 耗时
20s 0.003s

clieckhouse 查问后果

1 rows in set. Elapsed: 0.003 sec. 

Test2

select a.order_id,sum(a.field_id_1),sum(a.field_id_2) from test.test_order_detail as a join test.test_order as b on a.order_id = b.id group by a.order_id;

join 表聚合数据 这个级别的数据 mysql 曾经扛不住了

Mysql 耗时 ClickHouse 耗时
0.450s

clieckhouse 查问后果,因为没有应用所有,扫描了全表,总共解决 1800 万行数据,没秒竟然能够解决 4000 万行数据,效率十分高

1042 rows in set. Elapsed: 0.450 sec. Processed 18.13 million rows, 435.11 MB (40.28 million rows/s., 966.66 MB/s.) 

Test3

select a.order_id,sum(a.field_id_1),sum(a.field_id_2) from test.test_order_detail as a join test.test_order as b on a.order_id = b.id group by a.order_id limit 1,20;

加个 limit 试试 等了很久 mysql 仍然没有返回后果

Mysql 耗时 ClickHouse 耗时
0.574s

clieckhouse 查问后果

20 rows in set. Elapsed: 0.574 sec. Processed 18.13 million rows, 435.11 MB (31.60 million rows/s., 758.37 MB/s.) 

Test4

select count(*) from test.test_order_detail

单表聚合数据 等了很久 mysql 仍然没有返回后果

Mysql 耗时 ClickHouse 耗时
0.212

clieckhouse 查问后果)

20 rows in set. Elapsed: 0.212 sec. Processed 18.13 million rows, 435.10 MB (85.63 million rows/s., 2.06 GB/s.) 

总结

在数据量比拟少的状况,且 sql 比较简单的场景下,mysql 还是十分不便的,但在大数据场景下,mysql 就顾此失彼了,通过本文的以下简略测试,就是发现 clickhouse 非常适合大数据场景下的数据查问,利用 列式存储 数据压缩 个性,能够高效率解决数据,另外 SummingMergeTreeAggregatingMergeTree 更高效率的进行数据预聚合,有工夫会进一步分享更多内容。

正文完
 0