关于大数据:Hive面试题之连续登录行转列和列转行分析

7次阅读

共计 2382 个字符,预计需要花费 6 分钟才能阅读完成。

数据筹备
id login_date
01 2021-02-28
01 2021-03-01
01 2021-03-02
01 2021-03-04
01 2021-03-05
01 2021-03-06
01 2021-03-08
02 2021-03-01
02 2021-03-02
02 2021-03-03
02 2021-03-06
03 2021-03-06

计划一
1. 先把数据依照用户 id 分组,依据登录日期排序
SQL:
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data;

后果:
id login_date rn
01 2021-02-28 1
01 2021-03-01 2
01 2021-03-02 3
01 2021-03-04 4
01 2021-03-05 5
01 2021-03-06 6
01 2021-03-08 7
02 2021-03-01 1
02 2021-03-02 2
02 2021-03-03 3
02 2021-03-06 4
03 2021-03-06 1

2. 用登录日期与 rn 求 date_sub,失去的差值日期如果是相等大数据培训的,则阐明这两天必定是间断的
SQL:
SELECT

 t1.id,
 t1.login_date,
 date_sub(t1.login_date, rn) as diff_date

FROM

(
   SELECT
       id,
       login_date,
       row_number() over(partition by id order by login_date asc) as rn 
   FROM data
) t1;
后果:
id login_date diff_date
01 2021-02-28 2021-02-27
01 2021-03-01 2021-02-27
01 2021-03-02 2021-02-27
01 2021-03-04 2021-02-28
01 2021-03-05 2021-02-28
01 2021-03-06 2021-02-28
01 2021-03-08 2021-03-01
02 2021-03-01 2021-02-28
02 2021-03-02 2021-02-28
02 2021-03-03 2021-02-28
02 2021-03-06 2021-03-02
03 2021-03-06 2021-03-05

3. 依据 id 和日期差 date_diff 分组,登录次数即为分组后的 count(1)
SQL:
SELECT
t2.id,
count(1) as login_times,
min(t2.login_date) as start_date,
max(t2.login_date) as end_date
FROM
(

SELECT
 t1.id,
 t1.login_date,
 date_sub(t1.login_date,rn) as diff_date
FROM
(
    SELECT
     id,
     login_date,
     row_number() over(partition by id order by login_date asc) as rn 
    FROM data
) t1

) t2
group by t2.id, t2.diff_date
having login_times >= 3;

后果:
id login_times start_date end_date
01 3 2021-02-28 2021-03-02
01 3 2021-03-04 2021-03-06
02 3 2021-03-01 2021-03-03

计划二
计划二利用 lag 和 lead 函数进行解决,思路相似。
SQL:
SELECT
id,
lag_login_date,
login_date,lead_login_date
FROM

  (SELECT 
     id,
     login_date,
     lag(login_date,1,login_date) over(partition by id order by login_date) as lag_login_date,
     lead(login_date,1,login_date) over(partition by id order by login_date) as lead_login_date
  FROM data
  ) t1
where datediff(login_date,lag_login_date) =1 and datediff(lead_login_date,login_date) =1; 后果:
id lag_login_date login_date lead_login_date
01 2018-02-28 2018-03-01 2018-03-02
01 2018-03-04 2018-03-05 2018-03-06
02 2018-03-01 2018-03-02 2018-03-03

行转列和列转行
以 ” 间断登录 ” 中的数据为例:
select id,

   concat_ws(',',collect_list(login_date)) cw

from data

group by id; 后果:
id cw
01 2018-02-28,2018-03-01,2018-03-02,2018-03-04,2018-03-05,2018-03-06,2018-03-08
02 2018-03-01,2018-03-02,2018-03-03,2018-03-06
03 2018-03-06

以下面 SQL 生成的数据为基准,执行下列 SQL:
select id, login_date
from t

lateral view explode(split(cw,’,’)) b AS login_date; 后果:
id login_date
01 2018-02-28
01 2018-03-01
01 2018-03-02
01 2018-03-04
01 2018-03-05
01 2018-03-06
01 2018-03-08
02 2018-03-01
02 2018-03-02
02 2018-03-03
02 2018-03-06
03 2018-03-06
正文完
 0