共计 2382 个字符,预计需要花费 6 分钟才能阅读完成。
数据筹备 | |
---|---|
id | login_date |
01 | 2021-02-28 |
01 | 2021-03-01 |
01 | 2021-03-02 |
01 | 2021-03-04 |
01 | 2021-03-05 |
01 | 2021-03-06 |
01 | 2021-03-08 |
02 | 2021-03-01 |
02 | 2021-03-02 |
02 | 2021-03-03 |
02 | 2021-03-06 |
03 | 2021-03-06 |
计划一
1. 先把数据依照用户 id 分组,依据登录日期排序
SQL:
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data;
后果: | ||
---|---|---|
id | login_date | rn |
01 | 2021-02-28 | 1 |
01 | 2021-03-01 | 2 |
01 | 2021-03-02 | 3 |
01 | 2021-03-04 | 4 |
01 | 2021-03-05 | 5 |
01 | 2021-03-06 | 6 |
01 | 2021-03-08 | 7 |
02 | 2021-03-01 | 1 |
02 | 2021-03-02 | 2 |
02 | 2021-03-03 | 3 |
02 | 2021-03-06 | 4 |
03 | 2021-03-06 | 1 |
2. 用登录日期与 rn 求 date_sub,失去的差值日期如果是相等大数据培训的,则阐明这两天必定是间断的
SQL:
SELECT
t1.id,
t1.login_date,
date_sub(t1.login_date, rn) as diff_date
FROM
(
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data
) t1;
后果: | ||
---|---|---|
id | login_date | diff_date |
01 | 2021-02-28 | 2021-02-27 |
01 | 2021-03-01 | 2021-02-27 |
01 | 2021-03-02 | 2021-02-27 |
01 | 2021-03-04 | 2021-02-28 |
01 | 2021-03-05 | 2021-02-28 |
01 | 2021-03-06 | 2021-02-28 |
01 | 2021-03-08 | 2021-03-01 |
02 | 2021-03-01 | 2021-02-28 |
02 | 2021-03-02 | 2021-02-28 |
02 | 2021-03-03 | 2021-02-28 |
02 | 2021-03-06 | 2021-03-02 |
03 | 2021-03-06 | 2021-03-05 |
3. 依据 id 和日期差 date_diff 分组,登录次数即为分组后的 count(1)
SQL:
SELECT
t2.id,
count(1) as login_times,
min(t2.login_date) as start_date,
max(t2.login_date) as end_date
FROM
(
SELECT
t1.id,
t1.login_date,
date_sub(t1.login_date,rn) as diff_date
FROM
(
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data
) t1
) t2
group by t2.id, t2.diff_date
having login_times >= 3;
后果: | |||
---|---|---|---|
id | login_times | start_date | end_date |
01 | 3 | 2021-02-28 | 2021-03-02 |
01 | 3 | 2021-03-04 | 2021-03-06 |
02 | 3 | 2021-03-01 | 2021-03-03 |
计划二
计划二利用 lag 和 lead 函数进行解决,思路相似。
SQL:
SELECT
id,
lag_login_date,
login_date,lead_login_date
FROM
(SELECT
id,
login_date,
lag(login_date,1,login_date) over(partition by id order by login_date) as lag_login_date,
lead(login_date,1,login_date) over(partition by id order by login_date) as lead_login_date
FROM data
) t1
where datediff(login_date,lag_login_date) =1 and datediff(lead_login_date,login_date) =1; 后果: | |||
---|---|---|---|
id | lag_login_date | login_date | lead_login_date |
01 | 2018-02-28 | 2018-03-01 | 2018-03-02 |
01 | 2018-03-04 | 2018-03-05 | 2018-03-06 |
02 | 2018-03-01 | 2018-03-02 | 2018-03-03 |
行转列和列转行
以 ” 间断登录 ” 中的数据为例:
select id,
concat_ws(',',collect_list(login_date)) cw
from data
group by id; 后果: | |
---|---|
id | cw |
01 | 2018-02-28,2018-03-01,2018-03-02,2018-03-04,2018-03-05,2018-03-06,2018-03-08 |
02 | 2018-03-01,2018-03-02,2018-03-03,2018-03-06 |
03 | 2018-03-06 |
以下面 SQL 生成的数据为基准,执行下列 SQL:
select id, login_date
from t
lateral view explode(split(cw,’,’)) b AS login_date; 后果: | |
---|---|
id | login_date |
01 | 2018-02-28 |
01 | 2018-03-01 |
01 | 2018-03-02 |
01 | 2018-03-04 |
01 | 2018-03-05 |
01 | 2018-03-06 |
01 | 2018-03-08 |
02 | 2018-03-01 |
02 | 2018-03-02 |
02 | 2018-03-03 |
02 | 2018-03-06 |
03 | 2018-03-06 |