一文解决SQL间断问题!
计算间断沉闷
间断沉闷或者登录的计算是数据分析业务场景中十分经典和常见的问题,而且波及到的解题思路比拟奇妙,所以也是面试口试中十分喜爱考查的知识点~
记录用户沉闷工夫信息的表logins
如下:
+--------+---------------+
| userid | activity_date |
+--------+---------------+
| 10001 | 2021-01-01 |
| 10001 | 2021-01-01 |
| 10001 | 2021-01-02 |
| 10001 | 2021-01-02 |
| 10001 | 2021-01-03 |
| 10002 | 2021-01-02 |
| 10002 | 2021-01-02 |
| 10002 | 2021-01-03 |
| 10002 | 2021-01-04 |
| 10002 | 2021-01-04 |
+--------+---------------+
需要:计算间断沉闷2天及以上的用户。
步骤1:用户沉闷日期去重
因为一个用户同一天可能沉闷屡次,咱们只须要晓得用户在某天是否沉闷即可,所以须要对用户id+沉闷日期去重。
select
DISTINCT activity_date active_day,
userid
from logins;
-- 或者
select
userid,activity_date active_day
from logins
group by 1,2;
+--------+---------------+
| userid | activity_date |
+--------+---------------+
| 10001 | 2021-01-01 |
| 10001 | 2021-01-02 |
| 10001 | 2021-01-03 |
| 10002 | 2021-01-02 |
| 10002 | 2021-01-03 |
| 10002 | 2021-01-04 |
+--------+---------------+
步骤2:窗口函数按日期排序
有了第1步去重后的后果,咱们能够用row_number()
函数对其进行开窗(其余的也行),以userid
分组(partition by userid
),日期升序排序(order by active_day
),失去每个用户依照日期升序的排名。
select *,
row_number() over(PARTITION by userid
order by active_day) as rn
from
(select userid,activity_date as active_day
from logins
group by 1,2) tmp;
+--------+------------+----+
| userid | active_day | rn |
+--------+------------+----+
| 10001 | 2021-01-01 | 1 |
| 10001 | 2021-01-02 | 2 |
| 10001 | 2021-01-03 | 3 |
| 10002 | 2021-01-02 | 1 |
| 10002 | 2021-01-03 | 2 |
| 10002 | 2021-01-04 | 3 |
+--------+------------+----+
步骤3:序减定组
【序减定组】是啥意思?因为咱们用row_number
排序之后的名次是间断的,那么如果凑巧日期也是间断的,日期和排序差值不就是一个固定常数了吗?这样产生的每个不同的常数就代表不同的间断日期的组合!
咱们应用date_sub
函数将以后日期active_day
和以后的排序值rn
相减,失去一个新的列,命名为day_group
with t as
(
select *,
row_number() over(PARTITION by userid
order by active_day) as rn
from
(select
userid,activity_date as active_day
from logins
group by 1,2) tmp1)
select
t.*, date_sub(active_day, interval rn DAY) as day_group
from t
+--------+------------+----+------------+
| userid | active_day | rn | day_group |
+--------+------------+----+------------+
| 10001 | 2021-01-01 | 1 | 2020-12-31 |
| 10001 | 2021-01-02 | 2 | 2020-12-31 |
| 10001 | 2021-01-03 | 3 | 2020-12-31 |
| 10002 | 2021-01-02 | 1 | 2021-01-01 |
| 10002 | 2021-01-03 | 2 | 2021-01-01 |
| 10002 | 2021-01-04 | 3 | 2021-01-01 |
+--------+------------+----+------------+
步骤4:依据分组列计算间断天数
select
userid, day_group,
count(*) as continous_days
from
(select
t.*,
date_sub(active_day, interval rn DAY)
as day_group
from t) tmp
group by day_group, userid
having count(*) >= 2
+--------+------------+----------------+
| userid | day_group | continous_days |
+--------+------------+----------------+
| 10001 | 2020-12-31 | 3 |
| 10002 | 2021-01-01 | 3 |
+--------+------------+----------------+
到这里后果就进去了,userid为10001和10002的用户都间断沉闷了3天,也就是合乎咱们条件的用户。
本文由mdnice多平台公布
发表回复