关于大数据处理:基于-Databend-和腾讯云-COS-打造新型云数仓

3次阅读

共计 33740 个字符,预计需要花费 85 分钟才能阅读完成。

本篇文章向大家演示如何应用 Databend 基于腾讯云 COS 构建旧式数仓及其计算能力。如果你也在找一个低成本、高性能、反对弹性的数仓,Databend 能够为大家提供一个基于对象存储的云原生数仓解决方案。目前 Databend 反对数据的 stream load , copy into from stage , insert 等形式的数据写入,部署上反对单机和集群模式。须要更多反对增加微信: 82565387。文章较长,倡议珍藏 PC 端浏览。

Databend 介绍

Databend 是一款应用 Rust 研发、开源、齐全面向对象存储架构的旧式数仓,提供极速的弹性扩大能力,致力于打造按需、按量的 Data Cloud 产品体验。具备以下特点:
•Vectorized Execution 和 Pull&Push-Based Processor Model

•真正的存储、计算拆散架构,高性能、低成本,按需按量应用

•残缺的数据库反对,兼容 MySQL,Clickhouse 协定,SQL Over http

•欠缺的事务性,反对 Data Time Travel, Database Zero Clone 等性能

•反对基于同一份数据的多租户读写、共享操作

github repo: https://github.com/datafusela…
Docs : https://databend.rs
对于 Databend 架构图,参考:https://databend.rs/doc/

腾讯云 COS

对象存储(Cloud Object Storage,COS)是由腾讯云推出的无目录层次结构、无数据格式限度,可包容海量数据且反对 HTTP/HTTPS 协定拜访的分布式存储服务。腾讯云 COS 的存储桶空间无容量下限,无需分区治理,实用于 CDN 数据散发、数据万象解决或大数据计算与剖析的数据湖等多种场景。
官网:https://cloud.tencent.com/pro…

测试环境介绍

北京区:CVM SA2.8XLARGE64 & COS(ap-beijing)
操作系统:ubuntu-20
Databend:应用进二制公布版本 v0.6.99-nightly
下载地址:https://repo.databend.rs/data…
本次测试装置部署形式参考:https://databend.rs/doc/deplo…
集群部署模式参考:https://databend.rs/doc/deplo…

测试数据

wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/\
On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip

表构造参考:cat create_ontime.sql

CREATE TABLE ontime
(
    Year                            UInt16 NOT NULL,
    Quarter                         UInt8 NOT NULL,
    Month                           UInt8 NOT NULL,
    DayofMonth                      UInt8 NOT NULL,
    DayOfWeek                       UInt8 NOT NULL,
    FlightDate                      Date NOT NULL,
    Reporting_Airline               String NOT NULL,
    DOT_ID_Reporting_Airline        Int32 NOT NULL,
    IATA_CODE_Reporting_Airline     String NOT NULL,
    Tail_Number                     String NOT NULL,
    Flight_Number_Reporting_Airline String NOT NULL,
    OriginAirportID                 Int32 NOT NULL,
    OriginAirportSeqID              Int32 NOT NULL,
    OriginCityMarketID              Int32 NOT NULL,
    Origin                          String NOT NULL,
    OriginCityName                  String NOT NULL,
    OriginState                     String NOT NULL,
    OriginStateFips                 String NOT NULL,
    OriginStateName                 String NOT NULL,
    OriginWac                       Int32 NOT NULL,
    DestAirportID                   Int32 NOT NULL,
    DestAirportSeqID                Int32 NOT NULL,
    DestCityMarketID                Int32 NOT NULL,
    Dest                            String NOT NULL,
    DestCityName                    String NOT NULL,
    DestState                       String NOT NULL,
    DestStateFips                   String NOT NULL,
    DestStateName                   String NOT NULL,
    DestWac                         Int32 NOT NULL,
    CRSDepTime                      Int32 NOT NULL,
    DepTime                         Int32 NOT NULL,
    DepDelay                        Int32 NOT NULL,
    DepDelayMinutes                 Int32 NOT NULL,
    DepDel15                        Int32 NOT NULL,
    DepartureDelayGroups            String NOT NULL,
    DepTimeBlk                      String NOT NULL,
    TaxiOut                         Int32 NOT NULL,
    WheelsOff                       Int32 NOT NULL,
    WheelsOn                        Int32 NOT NULL,
    TaxiIn                          Int32 NOT NULL,
    CRSArrTime                      Int32 NOT NULL,
    ArrTime                         Int32 NOT NULL,
    ArrDelay                        Int32 NOT NULL,
    ArrDelayMinutes                 Int32 NOT NULL,
    ArrDel15                        Int32 NOT NULL,
    ArrivalDelayGroups              Int32 NOT NULL,
    ArrTimeBlk                      String NOT NULL,
    Cancelled                       UInt8 NOT NULL,
    CancellationCode                String NOT NULL,
    Diverted                        UInt8 NOT NULL,
    CRSElapsedTime                  Int32 NOT NULL,
    ActualElapsedTime               Int32 NOT NULL,
    AirTime                         Int32 NOT NULL,
    Flights                         Int32 NOT NULL,
    Distance                        Int32 NOT NULL,
    DistanceGroup                   UInt8 NOT NULL,
    CarrierDelay                    Int32 NOT NULL,
    WeatherDelay                    Int32 NOT NULL,
    NASDelay                        Int32 NOT NULL,
    SecurityDelay                   Int32 NOT NULL,
    LateAircraftDelay               Int32 NOT NULL,
    FirstDepTime                    String NOT NULL,
    TotalAddGTime                   String NOT NULL,
    LongestAddGTime                 String NOT NULL,
    DivAirportLandings              String NOT NULL,
    DivReachedDest                  String NOT NULL,
    DivActualElapsedTime            String NOT NULL,
    DivArrDelay                     String NOT NULL,
    DivDistance                     String NOT NULL,
    Div1Airport                     String NOT NULL,
    Div1AirportID                   Int32 NOT NULL,
    Div1AirportSeqID                Int32 NOT NULL,
    Div1WheelsOn                    String NOT NULL,
    Div1TotalGTime                  String NOT NULL,
    Div1LongestGTime                String NOT NULL,
    Div1WheelsOff                   String NOT NULL,
    Div1TailNum                     String NOT NULL,
    Div2Airport                     String NOT NULL,
    Div2AirportID                   Int32 NOT NULL,
    Div2AirportSeqID                Int32 NOT NULL,
    Div2WheelsOn                    String NOT NULL,
    Div2TotalGTime                  String NOT NULL,
    Div2LongestGTime                String NOT NULL,
    Div2WheelsOff                   String NOT NULL,
    Div2TailNum                     String NOT NULL,
    Div3Airport                     String NOT NULL,
    Div3AirportID                   Int32 NOT NULL,
    Div3AirportSeqID                Int32 NOT NULL,
    Div3WheelsOn                    String NOT NULL,
    Div3TotalGTime                  String NOT NULL,
    Div3LongestGTime                String NOT NULL,
    Div3WheelsOff                   String NOT NULL,
    Div3TailNum                     String NOT NULL,
    Div4Airport                     String NOT NULL,
    Div4AirportID                   Int32 NOT NULL,
    Div4AirportSeqID                Int32 NOT NULL,
    Div4WheelsOn                    String NOT NULL,
    Div4TotalGTime                  String NOT NULL,
    Div4LongestGTime                String NOT NULL,
    Div4WheelsOff                   String NOT NULL,
    Div4TailNum                     String NOT NULL,
    Div5Airport                     String NOT NULL,
    Div5AirportID                   Int32 NOT NULL,
    Div5AirportSeqID                Int32 NOT NULL,
    Div5WheelsOn                    String NOT NULL,
    Div5TotalGTime                  String NOT NULL,
    Div5LongestGTime                String NOT NULL,
    Div5WheelsOff                   String NOT NULL,
    Div5TailNum                     String NOT NULL

加载表构造:
cat create_ontime.sql | mysql -h127.0.0.1 -P3307 -uroot

数据加载
cat load_ontime.sh

echo "unzip ontime ,input your ontime zip dir: ./load_ontime.sh zip_dir"

ls $1/*.zip |xargs -I{} -P 4 bash -c "echo {}; unzip -q {}'*.csv'-d ./dataset"

if [$? -eq  0];
then
    echo "unzip success"
else
    echo "unzip was wrong!!!"
    exit 1
fi

cat create_ontime.sql |mysql -h127.0.0.1 -P3307 -uroot
if [$? -eq  0];
then
    echo "Ontime table create success"
else
    echo "Ontime table create was wrong!!!"
    exit 1
fi


time ls ./dataset/*.csv|xargs -P 8 -I{} curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:1" -F "upload=@{}" -XPUT http://localhost:8081/v1/streaming_load

应用办法

./load_ontime.sh ZIP 文件目录

基于 Ontime 测试 SQL 展现

Q1 查问 2000 年到 2008 年每天的总的航班总
(0.494 sec., 143.75 million rows/sec., 431.25 MB/sec)

mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
+-----------+---------+
| DayOfWeek | c       |
+-----------+---------+
|         5 | 8732422 |
|         1 | 8730614 |
|         4 | 8710843 |
|         3 | 8685626 |
|         2 | 8639632 |
|         7 | 8274367 |
|         6 | 7514194 |
+-----------+---------+
7 rows in set (0.50 sec)
Read 71000000 rows, 213 MB in 0.494 sec., 143.75 million rows/sec., 431.25 MB/sec.
mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: DayOfWeek:UInt8, count() as c:UInt64                                                                                                                                                                                                                  |
|   Sort: count():UInt64                                                                                                                                                                                                                                            |
|     AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]]                                                                                                                                                                                                      |
|       AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]]                                                                                                                                                                                                  |
|         Filter: ((Year >= 2000) and (Year <= 2008))                                                                                                                                                                                                               |
|           ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8], statistics: [read_rows: 71000000, read_bytes: 213000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4], filters: [((Year >= 2000) AND (Year <= 2008))]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)

Q2 查问 2000 年到 2008 年提早超过 10 分钟,每天总的提早产生状况
(0.543 sec., 130.71 million rows/sec., 914.95 GB/sec.)

mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
+-----------+---------+
| DayOfWeek | c       |
+-----------+---------+
|         5 | 2175733 |
|         4 | 2012848 |
|         1 | 1898879 |
|         7 | 1880896 |
|         3 | 1757508 |
|         2 | 1665303 |
|         6 | 1510894 |
+-----------+---------+
7 rows in set (0.54 sec)
Read 71000000 rows, 497 MB in 0.543 sec., 130.71 million rows/sec., 914.95 MB/sec.
mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: DayOfWeek:UInt8, count() as c:UInt64                                                                                                                                                                                                                                                            |
|   Sort: count():UInt64                                                                                                                                                                                                                                                                                      |
|     AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]]                                                                                                                                                                                                                                                |
|       AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]]                                                                                                                                                                                                                                            |
|         Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008))                                                                                                                                                                                                                                   |
|           ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 497000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q3 2000-2008 年机场的延误次数,显示最高的 10 条
(0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.)

mysql> SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
+--------+--------+
| Origin | c      |
+--------+--------+
| ORD    | 860911 |
| ATL    | 831822 |
| DFW    | 614403 |
| LAX    | 402671 |
| PHX    | 400475 |
| LAS    | 362026 |
| DEN    | 352893 |
| EWR    | 302267 |
| DTW    | 296832 |
| IAH    | 290729 |
+--------+--------+
10 rows in set (0.69 sec)
Read 71000000 rows, 1.21 GB in 0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.
mysql> explain SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit: 10                                                                                                                                                                                                                                                                                                     |
|   Projection: Origin:String, count() as c:UInt64                                                                                                                                                                                                                                                              |
|     Sort: count():UInt64                                                                                                                                                                                                                                                                                      |
|       AggregatorFinal: groupBy=[[Origin]], aggr=[[count()]]                                                                                                                                                                                                                                                   |
|         AggregatorPartial: groupBy=[[Origin]], aggr=[[count()]]                                                                                                                                                                                                                                               |
|           Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008))                                                                                                                                                                                                                                   |
|             ReadDataSource: scan schema: [Year:UInt16, Origin:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1271665856, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 14, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
7 rows in set (0.00 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q4 2007 年各航空公司延误的次数
(0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;
+---------+---------+
| Carrier | count() |
+---------+---------+
| WN      |  296451 |
| AA      |  179769 |
| MQ      |  152293 |
| OO      |  147019 |
| US      |  140199 |
| UA      |  135061 |
| XE      |  108571 |
| EV      |  104055 |
| NW      |  102206 |
| DL      |   98427 |
| CO      |   81039 |
| YV      |   79553 |
| FL      |   64583 |
| OH      |   60532 |
| AS      |   54326 |
| B6      |   53716 |
| 9E      |   48578 |
| F9      |   24100 |
| AQ      |    6764 |
| HA      |    4059 |
+---------+---------+
20 rows in set (0.19 sec)
Read 15000000 rows, 240 MB in 0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, count():UInt64                                                                                                                                                                                                                                |
|   Sort: count():UInt64                                                                                                                                                                                                                                                                                   |
|     AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]]                                                                                                                                                                                                                           |
|       AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]]                                                                                                                                                                                                                       |
|         Filter: ((DepDelay > 10) and (Year = 2007))                                                                                                                                                                                                                                                      |
|           ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((DepDelay > 10) AND (Year = 2007))]] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.00 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q5 2007 年各航空公司延误的千分比
(0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;
+---------+--------------------+
| Carrier | c3                 |
+---------+--------------------+
| EV      | 363.53123668047823 |
| AS      |  339.1453631738303 |
| US      |  288.8039271022377 |
| AA      |  283.6112877194699 |
| MQ      |  281.7663100792978 |
| B6      |  280.5745625489684 |
| UA      | 275.63356884257615 |
| YV      | 270.25567158804466 |
| OH      |  256.4567516268981 |
| WN      | 253.62165713752844 |
| CO      | 250.77750030171651 |
| XE      | 249.71881878589517 |
| NW      | 246.56113247419944 |
| F9      | 246.52209492635023 |
| OO      | 245.90051515354253 |
| FL      |  245.4143692596491 |
| DL      | 206.82764258051773 |
| 9E      | 187.66780889391967 |
| AQ      |  145.9016393442623 |
| HA      |  72.25634178905207 |
+---------+--------------------+
20 rows in set (0.27 sec)
Read 15000000 rows, 240 MB in 0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64                                                                                                                                                                   |
|   Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64                                                                                                                                                                                                                            |
|     Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy)                                                                                                                                                               |
|       AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]]                                                                                                                                                                            |
|         AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]]                                                                                                                                                                        |
|           Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy)                                                                                                                                                                          |
|             Filter: (Year = 2007)                                                                                                                                                                                                                                                      |
|               ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [(Year = 2007)]] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q6 2000-2008 年各航空公司延误的千分比
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;
+---------+--------------------+
| Carrier | c3                 |
+---------+--------------------+
| AS      | 293.05649076611434 |
| EV      |  282.0709981074399 |
| YV      |  270.3897636688929 |
| B6      | 257.40594891667007 |
| FL      | 249.28742951361826 |
| XE      | 246.59005902424192 |
| MQ      |  245.3695989400477 |
| WN      | 233.38127235928863 |
| DH      | 227.11013827345042 |
| F9      | 226.08455653226812 |
| UA      | 224.42824657703645 |
| OH      | 215.52882835147614 |
| AA      | 211.97122176454556 |
| US      | 206.60330294168244 |
| HP      | 205.31690167066455 |
| OO      |  202.4243177198239 |
| NW      |  191.7393936377831 |
| TW      |  188.6912623180138 |
| DL      | 187.84162871590732 |
| CO      | 187.71301306878976 |
| 9E      |  181.6396991511518 |
| RU      | 181.46244295416398 |
| TZ      |  176.8928125899626 |
| AQ      | 145.65911608293766 |
| HA      |  79.38672451825789 |
+---------+--------------------+
25 rows in set (0.94 sec)
Read 71000000 rows, 1.14 GB in 0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64                                                                                                                                                                                          |
|   Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64                                                                                                                                                                                                                                                   |
|     Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy)                                                                                                                                                                                      |
|       AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]]                                                                                                                                                                                                   |
|         AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]]                                                                                                                                                                                               |
|           Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy)                                                                                                                                                                                                 |
|             Filter: ((Year >= 2000) and (Year <= 2008))                                                                                                                                                                                                                                                       |
|               ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.003 sec., 0 rows/sec., 0 B/sec.

Q7 2000-2008 年各航空公司均匀延误工夫
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
+---------+--------------------+
| Carrier | c3                 |
+---------+--------------------+
| B6      | 16789.739456036365 |
| NW      | 11717.623092632819 |
| F9      | 11232.889558936127 |
| XE      | 17092.548853057146 |
| YV      |  17971.53933699898 |
| US      |   11868.7097884053 |
| RU      | 12556.249210602802 |
| AS      | 14735.545887755581 |
| HA      |  6851.555976883671 |
| OH      | 12655.103820799075 |
| UA      | 14594.243159716054 |
| TZ      | 12618.760195758565 |
| EV      | 16374.703330010156 |
| HP      | 11625.682112859839 |
| DH      | 15311.949983190174 |
| DL      | 10943.456441165357 |
| 9E      | 13091.087573576122 |
| FL      | 15192.451732538268 |
| MQ      | 14125.201554023559 |
| AQ      |  7323.278123603293 |
| OO      | 11600.594852741107 |
| AA      |  13508.78515494305 |
| TW      | 10842.722114986364 |
| WN      | 10484.932610056378 |
| CO      | 12671.595978518368 |
+---------+--------------------+
25 rows in set (0.74 sec)
Read 71000000 rows, 1.14 GB in 0.727 sec., 97.6 million rows/sec., 1.56 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(DepDelay) * 1000) as c3:Float64                                                                                                                                                                                                           |
|   Expression: IATA_CODE_Reporting_Airline:String, (avg(DepDelay) * 1000):Float64 (Before Projection)                                                                                                                                                                                                      |
|     AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]]                                                                                                                                                                                                                      |
|       AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]]                                                                                                                                                                                                                  |
|         Filter: ((Year >= 2000) and (Year <= 2008))                                                                                                                                                                                                                                                       |
|           ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q8 每年航班延误均匀工夫
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

mysql> SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
+------+--------------------+
| Year | avg(DepDelay)      |
+------+--------------------+
| 1987 | 12.380385692195556 |
| 1988 |  7.345867511864449 |
| 1989 |   8.81845473300008 |
| 1990 |  7.966702606180775 |
| 1991 |  6.940411174086677 |
| 1992 |  6.687364706154975 |
| 1993 |  7.207721091071671 |
| 1994 |  7.758752042452116 |
| 1995 |  9.328649903752932 |
| 1996 |  11.14468468976826 |
| 1997 |  9.919225483813925 |
| 1998 | 10.884314711941435 |
| 1999 | 11.567390524113748 |
| 2000 | 13.456897681824556 |
| 2001 | 10.895474364001354 |
| 2002 |   9.97856700710386 |
| 2003 |  9.778465263372038 |
| 2004 | 11.936799840656898 |
| 2005 |  12.60167890747495 |
| 2006 | 14.237297887039372 |
| 2007 | 15.431738868356579 |
| 2008 | 14.654588068064287 |
| 2009 | 13.168984006133062 |
| 2010 | 13.202976628175891 |
| 2011 | 13.496191548097778 |
| 2012 | 13.155971481255131 |
| 2013 | 14.901210490900201 |
| 2014 | 15.513697266113969 |
| 2015 | 14.638336410280733 |
| 2016 | 14.643883269504837 |
| 2017 |  15.70225324299191 |
| 2018 |  16.16188254545747 |
| 2019 | 16.983263489524507 |
| 2020 | 10.624498278073712 |
| 2021 | 15.289615417399649 |
+------+--------------------+
35 rows in set (1.04 sec)
Read 201816232 rows, 1.21 GB in 1.030 sec., 195.93 million rows/sec., 1.18 GB/sec.
mysql> explain SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: Year:UInt16, avg(DepDelay):Float64                                                                                                                                                                   |
|   AggregatorFinal: groupBy=[[Year]], aggr=[[avg(DepDelay)]]                                                                                                                                                      |
|     AggregatorPartial: groupBy=[[Year]], aggr=[[avg(DepDelay)]]                                                                                                                                                  |
|       ReadDataSource: scan schema: [Year:UInt16, DepDelay:Int32], statistics: [read_rows: 201816232, read_bytes: 1210897392, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 31]] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q9 每年有多少航班
(0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.)

mysql> SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;
+------+---------+
| Year | c1      |
+------+---------+
| 1987 |  440403 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1992 | 5092157 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 5967780 |
| 2002 | 5271359 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 2014 | 5819811 |
| 2015 | 5819079 |
| 2016 | 5617658 |
| 2017 | 5674621 |
| 2018 | 7213446 |
| 2019 | 7422037 |
| 2020 | 4688354 |
| 2021 | 5443512 |
+------+---------+
35 rows in set (0.52 sec)
Read 201816232 rows, 403.63 MB in 0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.
mysql> explain SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: Year:UInt16, count() as c1:UInt64                                                                                                                                               |
|   AggregatorFinal: groupBy=[[Year]], aggr=[[count()]]                                                                                                                                       |
|     AggregatorPartial: groupBy=[[Year]], aggr=[[count()]]                                                                                                                                   |
|       ReadDataSource: scan schema: [Year:UInt16], statistics: [read_rows: 201816232, read_bytes: 403632464, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0]] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q10 计算每月提早 15 分钟的航班平均数
(0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.)

mysql> SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
+-------------------+
| avg(cnt)          |
+-------------------+
| 81474.99019607843 |
+-------------------+
1 row in set (0.90 sec)
Read 201816232 rows, 1.41 GB in 0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.
mysql> explain SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: avg(cnt):Float64                                                                                                                                                                                                                                        |
|   AggregatorFinal: groupBy=[[]], aggr=[[avg(cnt)]]                                                                                                                                                                                                                  |
|     AggregatorPartial: groupBy=[[]], aggr=[[avg(cnt)]]                                                                                                                                                                                                              |
|       Projection: Year:UInt16, Month:UInt8, count() as cnt:UInt64                                                                                                                                                                                                   |
|         AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]]                                                                                                                                                                                                  |
|           AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]]                                                                                                                                                                                              |
|             Filter: (DepDel15 = 1)                                                                                                                                                                                                                                  |
|               ReadDataSource: scan schema: [Year:UInt16, Month:UInt8, DepDel15:Int32], statistics: [read_rows: 201816232, read_bytes: 1412713624, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2, 33], filters: [(DepDel15 = 1)]] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q11 计算每月航班平均数
(0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.)

mysql> SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;
+-------------------+
| avg(c1)           |
+-------------------+
| 494647.6274509804 |
+-------------------+
1 row in set (0.57 sec)
Read 201816232 rows, 605.45 MB in 0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.
mysql> explain SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: avg(c1):Float64                                                                                                                                                                                       |
|   AggregatorFinal: groupBy=[[]], aggr=[[avg(c1)]]                                                                                                                                                                 |
|     AggregatorPartial: groupBy=[[]], aggr=[[avg(c1)]]                                                                                                                                                             |
|       Projection: Year:UInt16, Month:UInt8, count() as c1:UInt64                                                                                                                                                  |
|         AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]]                                                                                                                                                |
|           AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]]                                                                                                                                            |
|             ReadDataSource: scan schema: [Year:UInt16, Month:UInt8], statistics: [read_rows: 201816232, read_bytes: 605448696, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
7 rows in set (0.02 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q12 显示 10 个两个城市直飞线航班最多的前 10 个
(2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.)

mysql> SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
+-------------------+-------------------+--------+
| OriginCityName    | DestCityName      | c      |
+-------------------+-------------------+--------+
| San Francisco, CA | Los Angeles, CA   | 514878 |
| Los Angeles, CA   | San Francisco, CA | 512147 |
| New York, NY      | Chicago, IL       | 456042 |
| Chicago, IL       | New York, NY      | 448756 |
| Chicago, IL       | Minneapolis, MN   | 437913 |
| Minneapolis, MN   | Chicago, IL       | 433688 |
| Los Angeles, CA   | Las Vegas, NV     | 428942 |
| Las Vegas, NV     | Los Angeles, CA   | 422825 |
| New York, NY      | Boston, MA        | 419405 |
| Boston, MA        | New York, NY      | 416324 |
+-------------------+-------------------+--------+
10 rows in set (2.94 sec)
Read 201816232 rows, 8.54 GB in 2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.
mysql> explain SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit: 10                                                                                                                                                                                                                            |
|   Projection: OriginCityName:String, DestCityName:String, count() as c:UInt64                                                                                                                                                        |
|     Sort: count():UInt64                                                                                                                                                                                                             |
|       AggregatorFinal: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]]                                                                                                                                                    |
|         AggregatorPartial: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]]                                                                                                                                                |
|           ReadDataSource: scan schema: [OriginCityName:String, DestCityName:String], statistics: [read_rows: 201816232, read_bytes: 9829664815, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15, 24]] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.00 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q13 显示飞机最多航班的 10 个城市
(1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.)

mysql> SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
+-----------------------+----------+
| OriginCityName        | c        |
+-----------------------+----------+
| Chicago, IL           | 12545243 |
| Atlanta, GA           | 10900284 |
| Dallas/Fort Worth, TX |  9011081 |
| Houston, TX           |  6844476 |
| Los Angeles, CA       |  6695628 |
| New York, NY          |  6309911 |
| Denver, CO            |  6283055 |
| Phoenix, AZ           |  5658884 |
| Washington, DC        |  4998047 |
| San Francisco, CA     |  4673365 |
+-----------------------+----------+
10 rows in set (1.23 sec)
Read 201816232 rows, 4.27 GB in 1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.
mysql> explain SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit: 10                                                                                                                                                                                                   |
|   Projection: OriginCityName:String, count() as c:UInt64                                                                                                                                                    |
|     Sort: count():UInt64                                                                                                                                                                                    |
|       AggregatorFinal: groupBy=[[OriginCityName]], aggr=[[count()]]                                                                                                                                         |
|         AggregatorPartial: groupBy=[[OriginCityName]], aggr=[[count()]]                                                                                                                                     |
|           ReadDataSource: scan schema: [OriginCityName:String], statistics: [read_rows: 201816232, read_bytes: 4914707403, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q14 查问 ontime 表总共有多少行
(0.002 sec., 443.51 rows/sec., 443.51 B/sec.)

mysql> SELECT count(*) FROM ontime;
+-----------+
| count()   |
+-----------+
| 201816232 |
+-----------+
1 row in set (0.01 sec)
Read 1 rows, 1 B in 0.002 sec., 443.51 rows/sec., 443.51 B/sec.
mysql> explain SELECT count(*) FROM ontime;
+-----------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------+
| Projection: count():UInt64                                                                                                              |
|   Projection: 201816232 as count():UInt64                                                                                               |
|     Expression: 201816232:UInt64 (Exact Statistics)                                                                                     |
|       ReadDataSource: scan schema: [dummy:UInt8], statistics: [read_rows: 1, read_bytes: 1, partitions_scanned: 1, partitions_total: 1] |
+-----------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

更多性能测试

Databend On Amazon S3 Performance
https://databend.rs/doc/perfo…

Databend On Alibaba Cloud ECS OSS Performance
https://databend.rs/doc/perfo…

Databend On Wasabi Performance
https://databend.rs/doc/perfo…

须要反对请增加微信:82565387 获取更多帮忙。

正文完
 0