乐趣区

关于大数据:最强最全面的大数据SQL面试系列

本套 SQL 题的答案是由许多小伙伴独特奉献的,1+ 1 的力量是远远大于 2 的,有不少题目都采纳了 十分奇妙的解法 ,也有不少题目 有多种解法。本套大数据 SQL 题不仅题目丰盛多样,答案更是精彩绝伦!

注:以下参考答案都通过简略数据场景进行测试通过,但并未测试其余简单状况。本文档的 SQL 次要应用Hive SQL

因内容较多,带目录的 PDF 查看是比拟不便的

最强最全面的大数据 SQL 经典面试题残缺 PDF 版

一、行列转换

形容 :表中记录了各年份各部门的均匀绩效考核问题。
\
表名:t1
\
表构造:

a -- 年份
b -- 部门
c -- 绩效得分

表内容

 a   b  c
2014  B  9
2015  A  8
2014  A  10
2015  B  7

问题一:多行转多列

问题形容:将上述表内容转为如下输入后果所示:

 a  col_A col_B
2014  10   9
2015  8    7

参考答案

select 
    a,
    max(case when b="A" then c end) col_A,
    max(case when b="B" then c end) col_B
from t1
group by a;

问题二:如何将后果转成源表?(多列转多行)

问题形容 :将 问题一 的后果转成源表,问题一后果表名为t1_2

参考答案

select 
    a,
    b,
    c
from (
    select a,"A" as b,col_a as c from t1_2 
    union all 
    select a,"B" as b,col_b as c from t1_2  
)tmp; 

问题三:同一部门会有多个绩效,求多行转多列后果

问题形容:2014 年公司组织架构调整,导致部门呈现多个绩效,业务及人员不同,无奈合并算绩效,源表内容如下:

2014  B  9
2015  A  8
2014  A  10
2015  B  7
2014  B  6

输入后果如下所示

 a    col_A  col_B
2014   10    6,9
2015   8     7

参考答案:

select 
    a,
    max(case when b="A" then c end) col_A,
    max(case when b="B" then c end) col_B
from (
    select 
        a,
        b,
        concat_ws(",",collect_set(cast(c as string))) as c
    from t1
    group by a,b
)tmp
group by a;

二、排名中取他值

表名 t2
\
表字段及内容

a    b   c
2014  A   3
2014  B   1
2014  C   2
2015  A   4
2015  D   3

问题一:按 a 分组取 b 字段最小时对应的 c 字段

输入后果如下所示

a   min_c
2014  3
2015  4

参考答案:

select
  a,
  c as min_c
from
(
      select
        a,
        b,
        c,
        row_number() over(partition by a order by b) as rn 
      from t2 
)a
where rn = 1;

问题二:按 a 分组取 b 字段排第二时对应的 c 字段

输入后果如下所示

 a  second_c
2014  1
2015  3

参考答案

select
  a,
  c as second_c
from
(
      select
        a,
        b,
        c,
        row_number() over(partition by a order by b) as rn 
      from t2 
)a
where rn = 2;

问题三:按 a 分组取 b 字段最小和最大时对应的 c 字段

输入后果如下所示

a    min_c  max_c
2014  3      2
2015  4      3

参考答案:

select
  a,
  min(if(asc_rn = 1, c, null)) as min_c,
  max(if(desc_rn = 1, c, null)) as max_c
from
(
      select
        a,
        b,
        c,
        row_number() over(partition by a order by b) as asc_rn,
        row_number() over(partition by a order by b desc) as desc_rn 
      from t2 
)a
where asc_rn = 1 or desc_rn = 1
group by a; 

问题四:按 a 分组取 b 字段第二小和第二大时对应的 c 字段

输入后果如下所示

a    min_c  max_c
2014  1      1
2015  3      4

参考答案

select
    ret.a
    ,max(case when ret.rn_min = 2 then ret.c else null end) as min_c
    ,max(case when ret.rn_max = 2 then ret.c else null end) as max_c
from (
    select
        *
        ,row_number() over(partition by t2.a order by t2.b) as rn_min
        ,row_number() over(partition by t2.a order by t2.b desc) as rn_max
    from t2
) as ret
where ret.rn_min = 2
or ret.rn_max = 2
group by ret.a;

问题五:按 a 分组取 b 字段前两小和前两大时对应的 c 字段

留神:需放弃 b 字段最小、最大排首位

输入后果如下所示

a    min_c  max_c
2014  3,1     2,1
2015  4,3     3,4

参考答案

select
  tmp1.a as a,
  min_c,
  max_c
from 
(
  select 
    a,
    concat_ws(',', collect_list(c)) as min_c
  from
    (
     select
       a,
       b,
       c,
       row_number() over(partition by a order by b) as asc_rn
     from t2
     )a
    where asc_rn <= 2 
    group by a 
)tmp1 
join 
(
  select 
    a,
    concat_ws(',', collect_list(c)) as max_c
  from
    (
     select
        a,
        b,
        c,
        row_number() over(partition by a order by b desc) as desc_rn 
     from t2
    )a
    where desc_rn <= 2
    group by a 
)tmp2 
on tmp1.a = tmp2.a; 

三、累计求值

表名 t3
\
表字段及内容

a    b   c
2014  A   3
2014  B   1
2014  C   2
2015  A   4
2015  D   3

问题一:按 a 分组按 b 字段排序,对 c 累计求和

输入后果如下所示

a    b   sum_c
2014  A   3
2014  B   4
2014  C   6
2015  A   4
2015  D   7

参考答案

select 
  a, 
  b, 
  c, 
  sum(c) over(partition by a order by b) as sum_c
from t3; 

问题二:按 a 分组按 b 字段排序,对 c 取累计平均值

输入后果如下所示

a    b   avg_c
2014  A   3
2014  B   2
2014  C   2
2015  A   4
2015  D   3.5

参考答案

select 
  a, 
  b, 
  c, 
  avg(c) over(partition by a order by b) as avg_c
from t3;

问题三:按 a 分组按 b 字段排序,对 b 取累计排名比例

输入后果如下所示

a    b   ratio_c
2014  A   0.33
2014  B   0.67
2014  C   1.00
2015  A   0.50
2015  D   1.00

参考答案

select 
  a, 
  b, 
  c, 
  round(row_number() over(partition by a order by b) / (count(c) over(partition by a)),2) as ratio_c
from t3 
order by a,b;

问题四:按 a 分组按 b 字段排序,对 b 取累计求和比例

输入后果如下所示

a    b   ratio_c
2014  A   0.50
2014  B   0.67
2014  C   1.00
2015  A   0.57
2015  D   1.00

参考答案

select 
  a, 
  b, 
  c, 
  round(sum(c) over(partition by a order by b) / (sum(c) over(partition by a)),2) as ratio_c
from t3 
order by a,b;

四、窗口大小管制

表名 t4
\
表字段及内容

a    b   c
2014  A   3
2014  B   1
2014  C   2
2015  A   4
2015  D   3

问题一:按 a 分组按 b 字段排序,对 c 取前后各一行的和

输入后果如下所示

a    b   sum_c
2014  A   1
2014  B   5
2014  C   1
2015  A   3
2015  D   4

参考答案

select 
  a,
  b,
  lag(c,1,0) over(partition by a order by b)+lead(c,1,0) over(partition by a order by b) as sum_c
from t4;

问题二:按 a 分组按 b 字段排序,对 c 取平均值

问题形容:前一行与以后行的均值!

输入后果如下所示

a    b   avg_c
2014  A   3
2014  B   2
2014  C   1.5
2015  A   4
2015  D   3.5

参考答案
\
此处给出两种解法,其一:

select
    a,
    b,
    avg(c) over(partition by a order by b rows between 1 preceding and current row)
from
t4;

其二:

select
  a,
  b,
  case when lag_c is null then c
  else (c+lag_c)/2 end as avg_c
from
 (
 select
   a,
   b,
   c,
   lag(c,1) over(partition by a order by b) as lag_c
  from t4
 )temp;

五、产生间断数值

输入后果如下所示

1
2
3
4
5
...
100

参考答案
\
不借助其余任何表面,实现产生间断数值
\
此处给出两种解法,其一:

select
id_start+pos as id
from(
    select
    1 as id_start,
    1000000 as id_end
) m  lateral view posexplode(split(space(id_end-id_start), '')) t as pos, val

其二:

select
  row_number() over() as id
from  
  (select split(space(99), ' ') as x) t
lateral view
explode(x) ex;

那如何产生 1 至 1000000 间断数值?

参考答案

select
  row_number() over() as id
from  
  (select split(space(999999), ' ') as x) t
lateral view
explode(x) ex;

六、数据裁减与膨胀

表名 t6
\
表字段及内容

a
3
2
4

问题一:数据裁减

输入后果如下所示

a     b
3   3、2、1
2   2、1
4   4、3、2、1

参考答案

select  
  t.a,
  concat_ws('、',collect_set(cast(t.rn as string))) as b
from
(  
  select  
    t6.a,
    b.rn
  from t6
  left join
  ( 
   select
     row_number() over() as rn
   from  
   (select split(space(5), ' ') as x) t -- space(5)可依据 t6 表的最大值灵便调整
   lateral view
   explode(x) pe
  ) b
  on 1 = 1
  where t6.a >= b.rn
  order by t6.a, b.rn desc 
) t
group by  t.a;

问题二:数据裁减,排除偶数

输入后果如下所示

a     b
3   3、1
2   1
4   3、1

参考答案

select  
  t.a,
  concat_ws('、',collect_set(cast(t.rn as string))) as b
from
(  
  select  
    t6.a,
    b.rn
  from t6
  left join
  ( 
   select
     row_number() over() as rn
   from  
   (select split(space(5), ' ') as x) t
   lateral view
   explode(x) pe
  ) b
  on 1 = 1
  where t6.a >= b.rn and b.rn % 2 = 1
  order by t6.a, b.rn desc 
) t
group by  t.a;

问题三:如何解决字符串累计拼接

问题形容:将小于等于 a 字段的值聚合拼接起来

输入后果如下所示

a     b
3     2、3
2     2
4     2、3、4

参考答案

select  
  t.a,
  concat_ws('、',collect_set(cast(t.a1 as string))) as b
from
(   
  select  
    t6.a,
    b.a1
  from t6
  left join
  (   
   select  a as a1 
   from t6
  ) b
  on 1 = 1
  where t6.a >= b.a1
  order by t6.a, b.a1 
) t
group by  t.a;

问题四:如果 a 字段有反复,如何实现字符串累计拼接

输入后果如下所示

a     b
2     2
3     2、3
3     2、3、3
4     2、3、3、4

参考答案

select 
  a,
  b
from 
(
 select  
   t.a,
   t.rn,
   concat_ws('、',collect_list(cast(t.a1 as string))) as b
 from
  (   
    select  
     a.a,
     a.rn,
     b.a1
    from
    (
     select  
       a,
       row_number() over(order by a) as rn 
     from t6
    ) a
    left join
    (   
     select  a as a1,
     row_number() over(order by a) as rn  
     from t6
    ) b
    on 1 = 1
    where a.a >= b.a1 and a.rn >= b.rn 
    order by a.a, b.a1 
  ) t
  group by  t.a,t.rn
  order by t.a,t.rn
) tt; 

问题五:数据开展

问题形容:如何将字符串 ”1-5,16,11-13,9″ 扩大成 ”1,2,3,4,5,16,11,12,13,9″?留神程序不变。

参考答案

select  
  concat_ws(',',collect_list(cast(rn as string)))
from
(
  select  
   a.rn,
   b.num,
   b.pos
  from
   (
    select
     row_number() over() as rn
    from (select split(space(20), ' ') as x) t -- space(20)可灵便调整
    lateral view
    explode(x) pe
   ) a lateral view outer 
   posexplode(split('1-5,16,11-13,9', ',')) b as pos, num
   where a.rn between cast(split(num, '-')[0] as int) and cast(split(num, '-')[1] as int) or a.rn = num
   order by pos, rn 
) t;

七、合并与拆分

表名 t7
\
表字段及内容

a    b
2014  A
2014  B
2015  B
2015  D

问题一:合并

输入后果如下所示

2014  A、B
2015  B、D

参考答案:

select
  a,
  concat_ws('、', collect_set(t.b)) b
from t7
group by a;

问题二:拆分

问题形容:将分组合并的后果拆分进去

参考答案

select
  t.a,
  d
from
(
 select
  a,
  concat_ws('、', collect_set(t7.b)) b
 from t7
 group by a
)t
lateral view 
explode(split(t.b, '、')) table_tmp as d;

八、模仿循环操作

表名 t8
\
表字段及内容

a
1011
0101
问题一:如何将字符 ’1’ 的地位提取进去

输入后果如下所示:

1,3,4
2,4

参考答案

select 
    a,
    concat_ws(",",collect_list(cast(index as string))) as res
from (
    select 
        a,
        index+1 as index,
        chr
    from (
        select 
            a,
            concat_ws(",",substr(a,1,1),substr(a,2,1),substr(a,3,1),substr(a,-1)) str
        from t8
    ) tmp1
    lateral view posexplode(split(str,",")) t as index,chr
    where chr = "1"
) tmp2
group by a;

九、不应用 distinct 或 group by 去重

表名 t9
\
表字段及内容

a     b     c    d
2014  2016  2014   A
2014  2015  2015   B

问题一:不应用 distinct 或 group by 去重

输入后果如下所示

2014  A
2016  A
2014  B
2015  B

参考答案

select
  t2.year
  ,t2.num
from
 (
  select
    *
    ,row_number() over (partition by t1.year,t1.num) as rank_1
  from 
  (
    select 
      a as year,
      d as num
    from t9
    union all
    select 
      b as year,
      d as num
    from t9
    union all
    select 
      c as year,
      d as num
    from t9
   )t1
)t2
where rank_1=1
order by num;

十、容器 – 反转内容

表名 t10
\
表字段及内容

a
AB,CA,BAD
BD,EA

问题一:反转逗号分隔的数据:扭转程序,内容不变

输入后果如下所示

BAD,CA,AB
EA,BD

参考答案

select 
  a,
  concat_ws(",",collect_list(reverse(str)))
from 
(
  select 
    a,
    str
  from t10
  lateral view explode(split(reverse(a),",")) t as str
) tmp1
group by a;

问题二:反转逗号分隔的数据:扭转内容,程序不变

输入后果如下所示

BA,AC,DAB
DB,AE

参考答案

select 
  a,
  concat_ws(",",collect_list(reverse(str)))
from 
(
  select 
     a,
     str
  from t10
  lateral view explode(split(a,",")) t as str
) tmp1
group by a;

十一、多容器 – 成对提取数据

表名 t11
\
表字段及内容

a       b
A/B     1/3
B/C/D   4/5/2

问题一:成对提取数据,字段一一对应

输入后果如下所示

a       b
A       1
B       3
B       4
C       5
D       2

参考答案:

select 
  a_inx,
  b_inx
from 
(
  select 
     a,
     b,
     a_id,
     a_inx,
     b_id,
     b_inx
  from t11
  lateral view posexplode(split(a,'/')) t as a_id,a_inx
  lateral view posexplode(split(b,'/')) t as b_id,b_inx
) tmp
where a_id=b_id;

十二、多容器 – 转多行

表名 t12
\
表字段及内容

a        b      c
001     A/B     1/3/5
002     B/C/D   4/5

问题一:转多行

输入后果如下所示

a        d       e
001     type_b    A
001     type_b    B
001     type_c    1
001     type_c    3
001     type_c    5
002     type_b    B
002     type_b    C
002     type_b    D
002     type_c    4
002     type_c    5

参考答案:

select 
  a,
  d,
  e
from 
(
  select
    a,
    "type_b" as d,
    str as e
  from t12
  lateral view explode(split(b,"/")) t as str
  union all 
  select
    a,
    "type_c" as d,
    str as e
  from t12
  lateral view explode(split(c,"/")) t as str
) tmp
order by a,d;

十三、形象分组 – 断点排序

表名 t13
\
表字段及内容

a    b
2014  1
2015  1
2016  1
2017  0
2018  0
2019  -1
2020  -1
2021  -1
2022  1
2023  1

问题一:断点排序

输入后果如下所示

a    b    c 
2014  1    1
2015  1    2
2016  1    3
2017  0    1
2018  0    2
2019  -1   1
2020  -1   2
2021  -1   3
2022  1    1
2023  1    2

参考答案:

select  
  a,
  b,
  row_number() over( partition by b,repair_a order by a asc) as c-- 依照 b 列和 [b 的组首] 分组,排序
from 
(
  select  
    a,
    b,
    a-b_rn as repair_a-- 依据 b 列值呈现的秩序, 修复 a 列值为 b 首次呈现的 a 列值, 称为 b 的[组首]
  from 
  (
   select 
     a,
     b,
     row_number() over( partition by b order by  a  asc) as b_rn-- 按 b 列分组, 按 a 列排序, 失去 b 列各值呈现的秩序
   from t13 
  )tmp1
)tmp2-- 留神,如果不同的 b 列值,可能呈现同样的组首值,但组首值须要和 a 列值 一并参加分组,故并不影响排序。order by a asc; 

十四、业务逻辑的分类与形象 – 时效

日期表 d_date
\
表字段及内容

date_id      is_work
2017-04-13       1
2017-04-14       1
2017-04-15       0
2017-04-16       0
2017-04-17       1

工作日:周一至周五 09:30-18:30

客户申请表 t14
\
表字段及内容

a      b       c
1     申请   2017-04-14 18:03:00
1     通过   2017-04-17 09:43:00
2     申请   2017-04-13 17:02:00
2     通过   2017-04-15 09:42:00

问题一:计算上表中从申请到通过占用的工作时长

输入后果如下所示

a         d
1        0.67h
2       10.67h 

参考答案:

select 
    a,
    round(sum(diff)/3600,2) as d
from (
    select 
        a,
        apply_time,
        pass_time,
        dates,
        rn,
        ct,
        is_work,
        case when is_work=1 and rn=1 then unix_timestamp(concat(dates,'18:30:00'),'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss')
            when is_work=0 then 0
            when is_work=1 and rn=ct then unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(concat(dates,'09:30:00'),'yyyy-MM-dd HH:mm:ss')
            when is_work=1 and rn!=ct then 9*3600
        end diff
    from (
        select 
            a,
            apply_time,
            pass_time,
            time_diff,
            day_diff,
            rn,
            ct,
            date_add(start,rn-1) dates
        from (
            select 
                a,
                apply_time,
                pass_time,
                time_diff,
                day_diff,
                strs,
                start,
                row_number() over(partition by a) as rn,
                count(*) over(partition by a) as ct
            from (
                select 
                    a,
                    apply_time,
                    pass_time,
                    time_diff,
                    day_diff,
                    substr(repeat(concat(substr(apply_time,1,10),','),day_diff+1),1,11*(day_diff+1)-1) strs
                from (
                    select 
                        a,
                        apply_time,
                        pass_time,
                        unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss') time_diff,
                        datediff(substr(pass_time,1,10),substr(apply_time,1,10)) day_diff
                    from (
                        select 
                            a,
                            max(case when b='申请' then c end) apply_time,
                            max(case when b='通过' then c end) pass_time
                        from t14
                        group by a
                    ) tmp1
                ) tmp2
            ) tmp3 
            lateral view explode(split(strs,",")) t as start
        ) tmp4
    ) tmp5
    join d_date 
    on tmp5.dates = d_date.date_id
) tmp6
group by a;

十五、工夫序列 – 进度及残余

表名 t15
\
表字段及内容

date_id      is_work
2017-07-30      0
2017-07-31      1
2017-08-01      1
2017-08-02      1
2017-08-03      1
2017-08-04      1
2017-08-05      0
2017-08-06      0
2017-08-07      1

问题一:求每天的累计周工作日,残余周工作日

输入后果如下所示

date_id      week_to_work  week_left_work
2017-07-31      1             4
2017-08-01      2             3
2017-08-02      3             2
2017-08-03      4             1
2017-08-04      5             0
2017-08-05      5             0
2017-08-06      5             0

参考答案 :
\
此处给出两种解法,其一:

select 
 date_id
,case date_format(date_id,'u')
    when 1 then 1
    when 2 then 2 
    when 3 then 3 
    when 4 then 4
    when 5 then 5 
    when 6 then 5 
    when 7 then 5 
 end as week_to_work
,case date_format(date_id,'u')
    when 1 then 4
    when 2 then 3  
    when 3 then 2 
    when 4 then 1
    when 5 then 0 
    when 6 then 0 
    when 7 then 0 
 end as week_to_work
from t15

其二:

select
date_id,
week_to_work,
week_sum_work-week_to_work as week_left_work
from(
    select
    date_id,
    sum(is_work) over(partition by year,week order by date_id) as week_to_work,
    sum(is_work) over(partition by year,week) as week_sum_work
    from(
        select
        date_id,
        is_work,
        year(date_id) as year,
        weekofyear(date_id) as week
        from t15
    ) ta
) tb order by date_id;

十六、工夫序列 – 结构日期

问题一:间接应用 SQL 实现一张日期维度表,蕴含以下字段:

date                    string                  日期
d_week                  string                  年内第几周
weeks                   int                     周几
w_start                 string                  周开始日
w_end                   string                  周完结日
d_month                int                     第几月
m_start                string                  月开始日
m_end                  string                  月完结日
d_quarter            int                    第几季
q_start                string                  季开始日
q_end                  string                  季完结日
d_year               int                    年份
y_start                string                  年开始日
y_end                  string                  年完结日

参考答案

drop table if exists dim_date;
create table if not exists dim_date(
    `date` string comment '日期',
    d_week string comment '年内第几周',
    weeks string comment '周几',
    w_start string comment '周开始日',
    w_end string comment '周完结日',
    d_month string comment '第几月',
    m_start string comment '月开始日',
    m_end string comment '月完结日',
    d_quarter int comment '第几季',
    q_start string comment '季开始日',
    q_end string comment '季完结日',
    d_year int comment '年份',
    y_start string comment '年开始日',
    y_end string comment '年完结日'
);
-- 天然月: 指每月的 1 号到那个月的月底,它是依照阳历来计算的。就是从每月 1 号到月底,不论这个月有 30 天,31 天,29 天或者 28 天,都算是一个天然月。insert overwrite table dim_date
select `date`
     , d_week -- 年内第几周
     , case weekid
           when 0 then '周日'
           when 1 then '周一'
           when 2 then '周二'
           when 3 then '周三'
           when 4 then '周四'
           when 5 then '周五'
           when 6 then '周六'
    end  as weeks -- 周
     , date_add(next_day(`date`,'MO'),-7) as w_start -- 周一
     , date_add(next_day(`date`,'MO'),-1) as w_end   -- 周日_end
     -- 月份日期
     , concat('第', monthid, '月')  as d_month
     , m_start
     , m_end

     -- 节令
     , quarterid as d_quart
     , concat(d_year, '-', substr(concat('0', (quarterid - 1) * 3 + 1), -2), '-01') as q_start -- 季开始日
     , date_sub(concat(d_year, '-', substr(concat('0', (quarterid) * 3 + 1), -2), '-01'), 1) as q_end   -- 季完结日
     -- 年
     , d_year
     , y_start
     , y_end


from (
         select `date`
              , pmod(datediff(`date`, '2012-01-01'), 7)                  as weekid    -- 获取周几
              , cast(substr(`date`, 6, 2) as int)                        as monthid   -- 获取月份
              , case
                    when cast(substr(`date`, 6, 2) as int) <= 3 then 1
                    when cast(substr(`date`, 6, 2) as int) <= 6 then 2
                    when cast(substr(`date`, 6, 2) as int) <= 9 then 3
                    when cast(substr(`date`, 6, 2) as int) <= 12 then 4
             end                                                       as quarterid -- 获取节令 能够间接应用 quarter(`date`)
              , substr(`date`, 1, 4)                                     as d_year    -- 获取年份
              , trunc(`date`, 'YYYY')                                    as y_start   -- 年开始日
              , date_sub(trunc(add_months(`date`, 12), 'YYYY'), 1) as y_end     -- 年完结日
              , date_sub(`date`, dayofmonth(`date`) - 1)                 as m_start   -- 当月第一天
              , last_day(date_sub(`date`, dayofmonth(`date`) - 1))          m_end     -- 当月最初一天
              , weekofyear(`date`)                                       as d_week    -- 年内第几周
         from (
                    -- '2021-04-01' 是开始日期, '2022-03-31' 是截止日期
                  select date_add('2021-04-01', t0.pos) as `date`
                  from (
                           select posexplode(
                                          split(
                                                  repeat('o', datediff(from_unixtime(unix_timestamp('2022-03-31', 'yyyy-mm-dd'),
                                                                        'yyyy-mm-dd'),
                                                          '2021-04-01')), 'o'
                                              )
                                      )
                       ) t0
              ) t1
     ) t2;

十七、工夫序列 – 结构累积日期

表名 t17
\
表字段及内容

date_id
2017-08-01
2017-08-02
2017-08-03

问题一:每一日期,都扩大成月初至当天

输入后果如下所示

date_id    date_to_day
2017-08-01     2017-08-01
2017-08-02     2017-08-01
2017-08-02     2017-08-02
2017-08-03     2017-08-01
2017-08-03     2017-08-02
2017-08-03     2017-08-03

这种累积相干的表,常做桥接表。

参考答案:

select
  date_id,
  date_add(date_start_id,pos) as date_to_day
from
(
  select
    date_id,
    date_sub(date_id,dayofmonth(date_id)-1) as date_start_id
  from t17
) m  lateral view 
posexplode(split(space(datediff(from_unixtime(unix_timestamp(date_id,'yyyy-MM-dd')),from_unixtime(unix_timestamp(date_start_id,'yyyy-MM-dd')))), '')) t as pos, val;

十八、工夫序列 – 结构间断日期

表名 t18
\
表字段及内容

a             b         c
101        2018-01-01     10
101        2018-01-03     20
101        2018-01-06     40
102        2018-01-02     20
102        2018-01-04     30
102        2018-01-07     60

问题一:结构间断日期

问题形容:将表中数据的 b 字段裁减至范畴[2018-01-01, 2018-01-07],并累积对 c 求和。
\
b 字段的值是较稠密的。

输入后果如下所示

a             b          c      d
101        2018-01-01     10     10
101        2018-01-02      0     10
101        2018-01-03     20     30
101        2018-01-04      0     30
101        2018-01-05      0     30
101        2018-01-06     40     70
101        2018-01-07      0     70
102        2018-01-01      0      0
102        2018-01-02     20     20
102        2018-01-03      0     20
102        2018-01-04     30     50
102        2018-01-05      0     50
102        2018-01-06      0     50
102        2018-01-07     60    110

参考答案:

select
  a,
  b,
  c,
  sum(c) over(partition by a order by b) as d
from
(
  select
  t1.a,
  t1.b,
  case
    when t18.b is not null then t18.c
    else 0
  end as c
  from
  (
    select
    a,
    date_add(s,pos) as b
    from
    (
      select
        a, 
       '2018-01-01' as s, 
       '2018-01-07' as r
      from (select a from t18 group by a) ta
    ) m  lateral view 
      posexplode(split(space(datediff(from_unixtime(unix_timestamp(r,'yyyy-MM-dd')),from_unixtime(unix_timestamp(s,'yyyy-MM-dd')))), '')) t as pos, val
  ) t1
    left join t18
    on  t1.a = t18.a and t1.b = t18.b
) ts;

十九、工夫序列 – 取多个字段最新的值

表名 t19
\
表字段及内容

date_id   a   b    c
2014     AB  12    bc
2015         23    
2016               d
2017     BC 

问题一:如何一并取出最新日期

输入后果如下所示

date_a   a    date_b    b    date_c   c
2017    BC    2015     23    2016    d

参考答案 :
\
此处给出三种解法,其一:

SELECT  max(CASE WHEN rn_a = 1 THEN date_id else 0 END) AS date_a
        ,max(CASE WHEN rn_a = 1 THEN a else null END) AS a
        ,max(CASE WHEN rn_b = 1 THEN date_id else 0 END) AS date_b
        ,max(CASE WHEN rn_b = 1 THEN b else NULL  END) AS b
        ,max(CASE WHEN rn_c = 1 THEN date_id  else 0 END) AS date_c
        ,max(CASE WHEN rn_c = 1 THEN c else null END) AS c
FROM    (
            SELECT  date_id
                    ,a
                    ,b
                    ,c
                    -- 对每列上不为 null 的值  的 日期 进行排序
                    ,row_number()OVER( PARTITION BY 1 ORDER BY CASE WHEN a IS NULL THEN 0 ELSE date_id END DESC) AS rn_a
                    ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN b IS NULL THEN 0 ELSE date_id END DESC) AS rn_b
                    ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN c IS NULL THEN 0 ELSE date_id END DESC) AS rn_c
            FROM    t19
        ) t
WHERE   t.rn_a = 1
OR      t.rn_b = 1
OR      t.rn_c = 1;

其二:

SELECT  
   a.date_id
  ,a.a
  ,b.date_id
  ,b.b
  ,c.date_id
  ,c.c
FROM
(
   SELECT  
      t.date_id,
      t.a
   FROM  
   (
     SELECT  
       t.date_id
       ,t.a
       ,t.b
       ,t.c
     FROM t19 t INNER JOIN    t19 t1 ON t.date_id = t1.date_id AND t.a IS NOT NULL
   ) t
   ORDER BY t.date_id DESC
   LIMIT 1
) a
LEFT JOIN 
(
  SELECT  
    t.date_id
    ,t.b
  FROM    
  (
    SELECT  
      t.date_id
      ,t.b
    FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.b IS NOT NULL
  ) t
  ORDER BY t.date_id DESC
  LIMIT 1
) b ON 1 = 1 
LEFT JOIN
(
  SELECT  
    t.date_id
    ,t.c
  FROM    
  (
    SELECT  
      t.date_id
      ,t.c
    FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.c IS NOT NULL
  ) t
  ORDER BY t.date_id DESC
  LIMIT   1
) c
ON 1 = 1;

其三:

select 
  * 
from  
(select t1.date_id as date_a,t1.a from (select t1.date_id,t1.a  from t19 t1 where t1.a is not null) t1
  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.a is not null) t2
  on t1.date_id=t2.date_id
) t1
cross join
(select t1.date_b,t1.b from (select t1.date_id as date_b,t1.b  from t19 t1 where t1.b is not null) t1
  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.b is not null)t2
  on t1.date_b=t2.date_id
) t2
cross join 
(select t1.date_c,t1.c from (select t1.date_id as date_c,t1.c  from t19 t1 where t1.c is not null) t1
  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.c is not null)t2
  on t1.date_c=t2.date_id
) t3;

二十、工夫序列 – 补全数据

表名 t20
\
表字段及内容

date_id   a   b    c
2014     AB  12    bc
2015         23    
2016               d
2017     BC 

问题一:如何应用最新数据补全表格

输入后果如下所示

date_id   a   b    c
2014     AB  12    bc
2015     AB  23    bc
2016     AB  23    d
2017     BC  23    d

参考答案:

select 
  date_id, 
  first_value(a) over(partition by aa order by date_id) as a,
  first_value(b) over(partition by bb order by date_id) as b,
  first_value(c) over(partition by cc order by date_id) as c
from
(
  select 
    date_id,
    a,
    b,
    c,
    count(a) over(order by date_id) as aa,
    count(b) over(order by date_id) as bb,
    count(c) over(order by date_id) as cc
  from t20
)tmp1;

二十一、工夫序列 – 取最新实现状态的前一个状态

表名 t21
\
表字段及内容

date_id   a    b
2014     1    A
2015     1    B
2016     1    A
2017     1    B
2013     2    A
2014     2    B
2015     2    A
2014     3    A
2015     3    A
2016     3    B
2017     3    A

上表中 B 为实现状态

问题一:取最新实现状态的前一个状态

输入后果如下所示

date_id  a    b
2016     1    A
2013     2    A
2015     3    A

参考答案 :
\
此处给出两种解法,其一:

select
    t21.date_id,
    t21.a,
    t21.b
from
    (
        select
            max(date_id) date_id,
            a
        from
            t21
        where
            b = 'B'
        group by
            a
    ) t1
    inner join t21 on t1.date_id -1 = t21.date_id
and t1.a = t21.a;

其二:

select
  next_date_id as date_id
  ,a
  ,next_b as b
from(
  select
    *,min(nk) over(partition by a,b) as minb
  from(
    select
      *,row_number() over(partition by a order by date_id desc) nk
      ,lead(date_id) over(partition by a order by date_id desc) next_date_id
      ,lead(b) over(partition by a order by date_id desc) next_b
    from(select * from t21) t
  ) t
) t
where minb = nk and b = 'B';

问题二:如何将实现状态的过程合并

输入后果如下所示:

a   b_merge
1   A、B、A、B
2   A、B
3   A、A、B

参考答案

select
  a
  ,collect_list(b) as b
from(
  select
    *
    ,min(if(b = 'B',nk,null)) over(partition by a) as minb
  from(
    select
      *,row_number() over(partition by a order by date_id desc) nk
    from(select * from t21) t
  ) t
) t
where nk >= minb
group by a;

二十二、非等值连贯 – 范畴匹配

表 f 是事实表,表 d 是匹配表,在 hive 中如何将匹配表中的值关联到事实表中?

表 d 相当于拉链过的变动维,但日期范畴可能是不全的。

表 f

date_id  p_id
 2017    C
 2018    B
 2019    A
 2013    C

表 d

d_start    d_end    p_id   p_value
 2016     2018     A       1
 2016     2018     B       2
 2008     2009     C       4
 2010     2015     C       3

问题一:范畴匹配

输入后果如下所示

date_id  p_id   p_value
 2017    C      null
 2018    B      2
 2019    A      null
 2013    C      3

*参考答案 *:
\
此处给出两种解法,其一:

select 
  f.date_id,
  f.p_id,
  A.p_value
from f 
left join 
(
  select 
    date_id,
    p_id,
    p_value
  from 
  (
    select 
      f.date_id,
      f.p_id,
      d.p_value
    from f 
    left join d on f.p_id = d.p_id
    where f.date_id >= d.d_start and f.date_id <= d.d_end
  )A
)A
ON f.date_id = A.date_id;

其二:

select 
    date_id,
    p_id,
    flag as p_value
from (
    select 
        f.date_id,
        f.p_id,
        d.d_start,
        d.d_end,
        d.p_value,
        if(f.date_id between d.d_start and d.d_end,d.p_value,null) flag,
        max(d.d_end) over(partition by date_id) max_end
    from f
    left join d
    on f.p_id = d.p_id
) tmp
where d_end = max_end;

二十三、非等值连贯 – 最近匹配

表 t23_1 和表 t23_2 通过 a 和 b 关联时,有相等的取相等的值匹配,不相等时每一个 a 的值在 b 中找差值最小的来匹配。

t23_1 和 t23_2 为两个班的成绩单,t23_1 班的每个学生问题在 t23_2 班中找出问题最靠近的问题。

表 t23_1:a 中无反复值

a
1
2
4
5
8
10

表 t23_2:b 中无反复值

b
2
3
7
11
13

问题一:单向最近匹配

输入后果如下所示
\
留神:b 的值可能会被抛弃

a    b
1    2
2    2
4    3
5    3
5    7
8    7
10   11

参考答案

select 
  * 
from
(
  select 
    ttt1.a,
    ttt1.b 
  from
  (
    select 
      tt1.a,
      t23_2.b,
      dense_rank() over(partition by tt1.a order by abs(tt1.a-t23_2.b)) as dr 
    from 
    (
      select 
        t23_1.a 
      from t23_1 
      left join t23_2 on t23_1.a=t23_2.b 
      where t23_2.b is null
    ) tt1 
    cross join t23_2
  ) ttt1 
  where ttt1.dr=1 
  union all
  select 
    t23_1.a,
    t23_2.b 
  from t23_1 
  inner join t23_2 on t23_1.a=t23_2.b
) result_t 
order by result_t.a;

二十四、N 指标 – 累计去重

假如表 A 为事件流水表,客户当天有一条记录则视为当天沉闷。

表 A

   time_id          user_id
2018-01-01 10:00:00    001
2018-01-01 11:03:00    002
2018-01-01 13:18:00    001
2018-01-02 08:34:00    004
2018-01-02 10:08:00    002
2018-01-02 10:40:00    003
2018-01-02 14:21:00    002
2018-01-02 15:39:00    004
2018-01-03 08:34:00    005
2018-01-03 10:08:00    003
2018-01-03 10:40:00    001
2018-01-03 14:21:00    005

假如客户沉闷十分,一天产生的事件记录均匀达千条。

问题一:累计去重

输入后果如下所示

日期       当日沉闷人数     月累计沉闷人数_截至当日
date_id   user_cnt_act    user_cnt_act_month
2018-01-01      2                2
2018-01-02      3                4
2018-01-03      3                5

参考答案

SELECT  tt1.date_id
       ,tt2.user_cnt_act
       ,tt1.user_cnt_act_month
FROM
(   -- ④ 依照 t.date_id 分组求出 user_cnt_act_month,失去 tt1
    SELECT  t.date_id
           ,COUNT(user_id) AS user_cnt_act_month
    FROM
    (   -- ③ 表 a 和表 b 进行笛卡尔积,依照 a.date_id,b.user_id 分组,保障截止到当日的用户惟一,得出表 t。SELECT  a.date_id
               ,b.user_id
        FROM
        (   -- ① 依照日期分组,取出 date_id 字段当主表的维度字段 得出表 a
            SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id
            FROM test.temp_tanhaidi_20211213_1
            GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')
        ) a
        INNER JOIN
        (   -- ② 依照 date_id、user_id 分组,保障每天每个用户只有一条记录,得出表 b
            SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id
                   ,user_id
            FROM test.temp_tanhaidi_20211213_1
            GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')
                     ,user_id
        ) b
        ON 1 = 1
        WHERE a.date_id >= b.date_id
        GROUP BY  a.date_id
                 ,b.user_id
    ) t
    GROUP BY  t.date_id
) tt1
LEFT JOIN
(   -- ⑥ 依照 date_id 分组求出 user_cnt_act,失去 tt2
    SELECT  date_id
           ,COUNT(user_id) AS user_cnt_act
    FROM
    (   -- ⑤ 依照日期分组,取出 date_id 字段当主表的维度字段 得出表 a
        SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id
               ,user_id
        FROM test.temp_tanhaidi_20211213_1
        GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')
                 ,user_id
    ) a
    GROUP BY date_id
) tt2
ON tt2.date_id = tt1.date_id

参考

最强最全面的大数据 SQL 经典面试题残缺 PDF 版

退出移动版