关于大数据:深入MaxCompute-第十二弹-PIVOTUNPIVOT

55次阅读

共计 12060 个字符,预计需要花费 31 分钟才能阅读完成。

简介:  MaxCompute 推出新语法 – PIVOT/UNPIVOT:通过 PIVOT 关键字基于聚合将一个或者多个指定值的行转换为列;通过 UNPIVOT 关键字可将一个或者多个列转换为行。以更简洁易用的形式满足行转列和列转行的需要,简化了查问语句,进步了宽广大数据开发者的生产力。

MaxCompute(原 ODPS)是阿里云自主研发的具备业界领先水平的分布式大数据处理平台, 尤其在团体外部失去广泛应用,撑持了多个 BU 的外围业务。MaxCompute 除了继续优化性能外,也致力于晋升 SQL 语言的用户体验和表达能力,进步宽广 MaxCompute 开发者的生产力。

MaxCompute 基于 MaxCompute2.0 新一代的 SQL 引擎,显著晋升了 SQL 语言编译过程的易用性与语言的表达能力。咱们在此推出深刻 MaxCompute 系列文章

第一弹 – 善用 MaxCompute 编译器的谬误和正告
第二弹 – 新的根本数据类型与内建函数
第三弹 – 简单类型
第四弹 – CTE,VALUES,SEMIJOIN
第五弹 – SELECT TRANSFORM
第六弹 – User Defined Type
第七弹 – Grouping Set, Cube and Rollup
第八弹 – 动静类型函数
第九弹 – 脚本模式与参数视图

第十弹 – IF ELSE 分支语句

第十一弹 – QUALIFY

本文将向您介绍 MaxCompute 反对的新语法 – PIVOT/UNPIVOT,即通过 PIVOT 关键字基于聚合将一个或者多个指定值的行转换为列;通过 UNPIVOT 关键字可将一个或者多个列转换为行。常见的场景入下:

  • 场景 1
    某个业务表,须要把表中的值当做新的列,并且依据每个值聚合现有的后果,从而实现行转列的成果。在没有反对 PIVOT 前,要实现这个需要,须要联合 GROUP BY 语法 + 聚合函数 +Filter 语法过滤来实现。
  • 场景 2
    某个业务表,须要结构一个新的列,把原有的几个列名合并在这个列外面,并且用另一个新列来搁置原来几个列的值,从而实现列转行的成果。在没有反对 UNPIVOT 前,要实现这个需要,须要联合 CROSS JOIN 语法 +CASE WHEN 表达式来结构实现。

PIVOT/UNPIVOT 性能

PIVOT

PIVOT 概述

PIVOT 语法将指定的行旋转为多列,并且对其余列值聚合失去后果并旋转表。PIVOT 语法是 FROM 子句的一部分。

SELECT ... 
FROM ... 
PIVOT (<aggregate function> [AS <alias>] [, <aggregate function> [AS <alias>]] ... 
    FOR (<column> [, <column>] ...) 
    IN ((<value> [, <value>] ...) AS <new column> 
        [, (<value> [, <value>] ...) AS <new column>] 
        ... 
       ) 
    ) 
[...]
  • <aggregate_function>
    示意行转列时须要计算的聚合函数,且聚合函数的外层不能嵌套任何函数,能够是 Scalar 函数和列组成的表达式。同时聚合函数的参数外部不能有其余聚合函数、Window 函数,以及聚合函数的列只能是上游表中的列。
  • <alias>
    示意行转列时须要计算的聚合函数的对应列的别名。
  • <column>
    示意行转列的对应行的列名,不能是任何的表达式。
  • <value>
    示意行转列的对应行的值,也能够是表达式,然而不容许有任何的聚合函数和窗口函数,并且每一个元组内的元素数量要与 <column> 数量统一。
  • <new_column>
    示意行转列后新的列的别名,不指定别名时,会试图揣测别名,揣测失败会由零碎主动生成一个别名。

更具体的语法应用阐明可参考文档。

PIVOT 语法能够等效为 group by + aggregate function + filter 的联合。以上面这个例子为例

SELECT ...
FROM ...
PIVOT (
 agg1 AS a, agg2 AS b, ...
 FOR (axis1, ..., axisN)
 IN ((v11, ..., v1N) AS label1,
     (v21, ..., v2N) AS label2, 
     ...)
 )

下面的语法等效于

SELECT 
 k1, ... kN, 
 agg1 AS label1_a FILTER (where axis1 = v11 and ... and axisN = v1N), 
 agg2 AS label1_b FILTER (where axis1 = v21 and ... and axisN = v2N), 
 ..., 
 agg1 AS label2_a FILTER (where axis1 = v11 and ... and axisN = v1N),
 agg2 AS label2_b FILTER (where axis1 = v21 and ... and axisN = v2N), 
 ..., 
 FROM xxxxxx
 GROUP BY k1, ... kN

其中 FROM 内的表是 PIVOT 上游的后果,k1, … kN 是所有未在 agg1, agg2, … 和 axis1, …, axisN 呈现的列的汇合。

PVIOT 示例

  • 数据筹备。以下表代表几家连锁店对应物品在对应年份的销售状况。
create table shops_table as select * from (select * from values
('pen', 10, 500, 'shop1', 2020),
('pen', 11, 500, 'shop2', 2020),
('pen', 9, 300, 'shop3', 2020),
('pen', 12, 400,'shop4', 2020),
('pen', 15, 200, 'shop1', 2021),
('pen', 16, 300, 'shop2', 2021),
('pen', 16, 400, 'shop3', 2021),
('pen', 15, 300, 'shop4', 2021),
('ruler', 20, 700, 'shop1', 2020),
('ruler', 19, 900, 'shop2', 2020),
('ruler', 22, 800, 'shop3', 2020),
('ruler', 19, 700, 'shop4', 2020),
('ruler', 25, 300, 'shop1', 2021),
('ruler', 20, 500, 'shop2', 2021),
('ruler', 23, 500, 'shop3', 2021),
('ruler', 26, 600, 'shop4', 2021)
shops(item_name, count, sales, shop_name, year));
select * from shops_table;
-- 后果如下:+-----------+------------+------------+-----------+------------+
| item_name | count      | sales      | shop_name | year       |
+-----------+------------+------------+-----------+------------+
| pen       | 10         | 500        | shop1     | 2020       |
| pen       | 11         | 500        | shop2     | 2020       |
| pen       | 9          | 300        | shop3     | 2020       |
| pen       | 12         | 400        | shop4     | 2020       |
| pen       | 15         | 200        | shop1     | 2021       |
| pen       | 16         | 300        | shop2     | 2021       |
| pen       | 16         | 400        | shop3     | 2021       |
| pen       | 15         | 300        | shop4     | 2021       |
| ruler     | 20         | 700        | shop1     | 2020       |
| ruler     | 19         | 900        | shop2     | 2020       |
| ruler     | 22         | 800        | shop3     | 2020       |
| ruler     | 19         | 700        | shop4     | 2020       |
| ruler     | 25         | 300        | shop1     | 2021       |
| ruler     | 20         | 500        | shop2     | 2021       |
| ruler     | 23         | 500        | shop3     | 2021       |
| ruler     | 26         | 600        | shop4     | 2021       |
+-----------+------------+------------+-----------+------------+
  • 统计各个年份各个店对物品的卖出数量状况。

<!—->

    • 没有反对 PVIOT 语法前,实现如下:
SELECT  item_name
        ,year
        ,SUM(CASE shop_name WHEN 'shop1' THEN count END) AS shop1
        ,SUM(CASE shop_name WHEN 'shop2' THEN count END) AS shop2
        ,SUM(CASE shop_name WHEN 'shop3' THEN count END) AS shop3
        ,SUM(CASE shop_name WHEN 'shop4' THEN count END) AS shop4
FROM    shops_table
GROUP BY item_name
         ,year
;
-- 后果如下:+-----------+------------+------------+------------+------------+------------+
| item_name | year       | 'shop1'    | 'shop2'    | 'shop3'    | 'shop4'    |
+-----------+------------+------------+------------+------------+------------+
| pen       | 2020       | 10         | 11         | 9          | 12         |
| pen       | 2021       | 15         | 16         | 16         | 15         |
| ruler     | 2020       | 20         | 19         | 22         | 19         |
| ruler     | 2021       | 25         | 20         | 23         | 26         |
+-----------+------------+------------+------------+------------+------------+
    • 通过 PVIOT 语法实现如下:
select * from (select item_name, year,count,shop_name from shops_table)
pivot (sum(count) for shop_name in ('shop1', 'shop2', 'shop3', 'shop4'));
-- 后果如下:+------------+------------+------------+------------+------------+------------+
| item_name  | year       | 'shop1'    | 'shop2'    | 'shop3'    | 'shop4'    | 
+------------+------------+------------+------------+------------+------------+
| pen        | 2020       | 10         | 11         | 9          | 12         | 
| pen        | 2021       | 15         | 16         | 16         | 15         | 
| ruler      | 2020       | 20         | 19         | 22         | 19         | 
| ruler      | 2021       | 25         | 20         | 23         | 26         | 
+------------+------------+------------+------------+------------+------------+

能够在此时为聚合函数和新的列起别名,列名依据下划线合并:

select * from (select item_name, count, shop_name, year from shops_table)
pivot (sum(count) as sum_count for shop_name in ('shop1' as shop_name_1, 'shop2' as shop_name_2, 'shop3' as shop_name_3, 'shop4' as shop_name_4));
-- 后果如下:+------------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
| item_name  | year       | shop_name_1_sum_count | shop_name_2_sum_count | shop_name_3_sum_count | shop_name_4_sum_count | 
+------------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
| pen        | 2020       | 10                    | 11                    | 9                     | 12                    | 
| pen        | 2021       | 15                    | 16                    | 16                    | 15                    | 
| ruler      | 2020       | 20                    | 19                    | 22                    | 19                    | 
| ruler      | 2021       | 25                    | 20                    | 23                    | 26                    | 
+------------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
  • 计算每个物品每家商店每年的总卖出数量和最高销售额,通过 PIVOT 实现如下:
select * from shops_table
pivot (sum(count) as sum_count, max(sales) as max_sales for shop_name in ('shop1' as shop_name_1, 'shop2' as shop_name_2, 'shop3' as shop_name_3, 'shop4' as shop_name_4));
-- 后果如下:+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| item_name | year       | shop_name_1_sum_count | shop_name_2_sum_count | shop_name_3_sum_count | shop_name_4_sum_count | shop_name_1_max_sales | shop_name_2_max_sales | shop_name_3_max_sales | shop_name_4_max_sales |
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| pen       | 2020       | 10                    | 11                    | 9                     | 12                    | 500                   | 500                   | 300                   | 400                   |
| pen       | 2021       | 15                    | 16                    | 16                    | 15                    | 200                   | 300                   | 400                   | 300                   |
| ruler     | 2020       | 20                    | 19                    | 22                    | 19                    | 700                   | 900                   | 800                   | 700                   |
| ruler     | 2021       | 25                    | 20                    | 23                    | 26                    | 300                   | 500                   | 500                   | 600                   |
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
  • 只计算 shop1 在 2020 年和 2021 对于每件物品的总卖出数量和最高销售额,通过 PIVOT 实现如下:
select * from shops_table
pivot (sum(count) as sum_count, max(sales) as max_sales for (shop_name, year) in (('shop1', 2021) as shop1_2021, ('shop1', 2020) as shop1_2020));
-- 后果如下:+-----------+----------------------+----------------------+----------------------+----------------------+
| item_name | shop1_2021_sum_count | shop1_2020_sum_count | shop1_2021_max_sales | shop1_2020_max_sales |
+-----------+----------------------+----------------------+----------------------+----------------------+
| pen       | 15                   | 10                   | 200                  | 500                  |
| ruler     | 25                   | 20                   | 300                  | 700                  |
+-----------+----------------------+----------------------+----------------------+----------------------+

UNPIVOT

UNPIVOT 概述

UNPIVOT 语法通过将列转换为行来旋转表格,UNPIVOT 语法是 FROM 子句的一部分。

SELECT ...
FROM ...
UNPIVOT [EXCLUDE NULLS] (<new_column_of_name> [, <new_column_of_name>] ...
    FOR (<new_column_of_value> [, <new_column_of_value>] ...)
    IN ((<column> [, <column>] ...) AS (<column_value> [, <column_value>] ...)
        [, (<column> [, <column>] ...) AS (<column_value> [, <column_value>] ...)]
        ...
       )
    )
[...]
  • [EXCLUDE NULLS]
    若指定该语法,则会过滤掉所有都是 null 的行。
  • <new_column_of_name>
    列转行当前用于存储原有的列名的列,必须为列名不能是表达式也不能重名。数量须要和每一个 <column value> 元祖外部元素的数量雷同,其中 <column value> 不指定时,MaxCompute 会主动生成一组 string 类型的元祖。
  • <new_column_of_value>
    列转行当前用于存储原有的列对应值的列,必须为列名不能是表达式也不能重名,数量须要和每一个 <column> 元祖外部元素的数量雷同。
  • <column>
    用于列转行的原有的列。
  • <column_value>
    用于列转行的原有的列的别名,能够用于替换原有的列名,外部不容许有列名。

更具体的语法应用阐明可参考文档。

UNPIVOT 语法能够等效为 CROSS JOIN + CASE WHEN 表达式的联合。以上面这个例子为例:

SELECT ...
FROM ...
UNPIVOT [exclude nulls] ((measure1, ..., measureM)
 FOR (axis1, ..., axisN)
 IN ((c11, ..., c1M) AS (value11, ..., value1N),
     (c21, ..., c2M) AS (value21, ..., value2N), ...))
[...]

下面的语法等效于

SELECT  * FROM
(
 SELECT
 k1, ... kN,
 CASE 
 WHEN axis1 = value11 AND ... AND axisN = value1N THEN c11
 WHEN axis1 = value21 AND ... AND axisN = value2N THEN c21
 ...
 ELSE null AS measure1,
 ..., 
 CASE 
 WHEN axis1 = value11 AND ... AND axisN = value1N THEN c1M
 WHEN axis1 = value21 AND ... AND axisN = value2N THEN c2M
 ELSE null AS measureM, 
 axis1, ..., axisN
 FROM xxxx 
 JOIN (VALUES (value11, ..., value1N),(value21, ..., value2N), ... AS generated_table_name(axis1, ..., axisN))
)
[WHERE measure1 is not null OR ... OR measureM is not null]

UNPIVOT 示例

  • 数据筹备。以下表代表几家连锁店对应物品在对应年份的销售状况:
create table shops as select * from(select * from values
('pen', 2020, 100, 200, 300, 400),
('pen', 2021, 100, 200, 200, 100),
('ruler', 2020, 300, 400, 300, 200),
('ruler', 2021, 400, 300, 100, 100)
shops(item_name, year, shop1, shop2, shop3, shop4));
SELECT * from shops;
-- 执行后果:+-----------+------------+------------+------------+------------+------------+
| item_name | year       | shop1      | shop2      | shop3      | shop4      |
+-----------+------------+------------+------------+------------+------------+
| pen       | 2020       | 100        | 200        | 300        | 400        |
| pen       | 2021       | 100        | 200        | 200        | 100        |
| ruler     | 2020       | 300        | 400        | 300        | 200        |
| ruler     | 2021       | 400        | 300        | 100        | 100        |
+-----------+------------+------------+------------+------------+------------+
  • 旋转表,失去各个商店的销售数量,并且用新的列名 count 来代替。

<!—->

    • 没有 UNPIVOT 前的实现形式:
select * from(
select item_name,year, 'shop1' as shop_name, shop1 as count from shops
union ALL 
select item_name,year, 'shop2' as shop_name, shop2 as count from shops
UNION ALL 
select item_name,year, 'shop3' as shop_name, shop3 as count from shops
UNION ALL  
select item_name,year, 'shop4' as shop_name, shop4 as count from shops
);
-- 执行后果
+------------+------------+------------+------------+
| item_name  | year       | shop_name  | count      | 
+------------+------------+------------+------------+
| pen        | 2020       | shop1      | 100        | 
| pen        | 2021       | shop1      | 100        | 
| ruler      | 2020       | shop1      | 300        | 
| ruler      | 2021       | shop1      | 400        | 
| pen        | 2020       | shop2      | 200        | 
| pen        | 2021       | shop2      | 200        | 
| ruler      | 2020       | shop2      | 400        | 
| ruler      | 2021       | shop2      | 300        | 
| pen        | 2020       | shop3      | 300        | 
| pen        | 2021       | shop3      | 200        | 
| ruler      | 2020       | shop3      | 300        | 
| ruler      | 2021       | shop3      | 100        | 
| pen        | 2020       | shop4      | 400        | 
| pen        | 2021       | shop4      | 100        | 
| ruler      | 2020       | shop4      | 200        | 
| ruler      | 2021       | shop4      | 100        | 
+------------+------------+------------+------------+
    • 通过 UNPIVOT 实现:
select * from shops
unpivot (count for shop_name in (shop1, shop2, shop3, shop4));
-- 执行后果
+------------+------------+------------+------------+
| item_name  | year       | count      | shop_name  | 
+------------+------------+------------+------------+
| pen        | 2020       | 100        | shop1      | 
| pen        | 2020       | 200        | shop2      | 
| pen        | 2020       | 300        | shop3      | 
| pen        | 2020       | 400        | shop4      | 
| pen        | 2021       | 100        | shop1      | 
| pen        | 2021       | 200        | shop2      | 
| pen        | 2021       | 200        | shop3      | 
| pen        | 2021       | 100        | shop4      | 
| ruler      | 2020       | 300        | shop1      | 
| ruler      | 2020       | 400        | shop2      | 
| ruler      | 2020       | 300        | shop3      | 
| ruler      | 2020       | 200        | shop4      | 
| ruler      | 2021       | 400        | shop1      | 
| ruler      | 2021       | 300        | shop2      | 
| ruler      | 2021       | 100        | shop3      | 
| ruler      | 2021       | 100        | shop4      | 
+------------+------------+------------+------------+
  • 如果 shop1 和 shop2 是东区商店,shop3 和 shop4 是西区商店,接下来须要一个新的列示意东区商店和西区商店。其中 count1 和 count2 两列别离存储了两店的销售数量。
select * from shops
unpivot ((count1, count2) for shop_name in ((shop1, shop2) as 'east_shop', (shop3, shop4) as 'west_shop'));
-- 执行后果
+------------+------------+------------+------------+------------+
| item_name  | year       | count1     | count2     | shop_name  | 
+------------+------------+------------+------------+------------+
| pen        | 2020       | 100        | 200        | east_shop  | 
| pen        | 2020       | 300        | 400        | west_shop  | 
| pen        | 2021       | 100        | 200        | east_shop  | 
| pen        | 2021       | 200        | 100        | west_shop  | 
| ruler      | 2020       | 300        | 400        | east_shop  | 
| ruler      | 2020       | 300        | 200        | west_shop  | 
| ruler      | 2021       | 400        | 300        | east_shop  | 
| ruler      | 2021       | 100        | 100        | west_shop  | 
+------------+------------+------------+------------+------------+

别名能够是多列,然而对应的须要生成的新的列名要相应减少:

select * from shops
unpivot ((count1, count2) for (shop_name, location) in ((shop1, shop2) as ('east_shop', 'east'), (shop3, shop4) as ('west_shop', 'west')));
-- 执行后果
+------------+------------+------------+------------+------------+------------+
| item_name  | year       | count1     | count2     | shop_name  | location   | 
+------------+------------+------------+------------+------------+------------+
| pen        | 2020       | 100        | 200        | east_shop  | east       | 
| pen        | 2020       | 300        | 400        | west_shop  | west       | 
| pen        | 2021       | 100        | 200        | east_shop  | east       | 
| pen        | 2021       | 200        | 100        | west_shop  | west       | 
| ruler      | 2020       | 300        | 400        | east_shop  | east       | 
| ruler      | 2020       | 300        | 200        | west_shop  | west       | 
| ruler      | 2021       | 400        | 300        | east_shop  | east       | 
| ruler      | 2021       | 100        | 100        | west_shop  | west       | 
+------------+------------+------------+------------+------------+------------+

小结

PIVOT/UNPIVOT 语法,以更简洁易用的形式满足行转列和列转行的需要,简化了查问语句,进步了宽广大数据开发者的生产力。

正文完
 0