<article class=“article fmt article-content”><p></p><p>在本案例中,咱们将应用 Databend Cloud 对来自天池实验室的淘宝用户购物行为数据集进行剖析,一起发现乏味的购物行为。</p><p>该数据集为 CSV 格局,蕴含了 2017 年 11 月 25 日至 2017 年 12 月 3 日之间,有行为的约一百万随机用户的所有行为(包含点击、购买、加购、喜爱)。数据集的每一行示意一条用户行为,由以下 5 列组成,并以逗号分隔:</p><table><thead><tr><th>列名称</th><th>阐明</th></tr></thead><tbody><tr><td>用户 ID</td><td>整数类型,序列化后的用户 ID</td></tr><tr><td>商品 ID</td><td>整数类型,序列化后的商品 ID</td></tr><tr><td>商品类目 ID</td><td>整数类型,序列化后的商品所属类目 ID</td></tr><tr><td>行为类型</td><td>字符串,枚举类型,包含:‘pv’:商品详情页 pv,等价于点击; ‘buy’:商品购买; ‘cart’:将商品退出购物车; ‘fav’:珍藏商品</td></tr><tr><td>工夫戳</td><td>行为产生的工夫戳</td></tr></tbody></table><h2>筹备工作</h2><h3>下载数据集</h3><ol><li>下载淘宝用户购物行为数据集到本地,而后应用以下命令解压:</li></ol><pre><code>unzip UserBehavior.csv.zip</code></pre><ol start=“2”><li>将解压后的数据集文件 (UserBehavior.csv) 压缩为 gzip 格局:</li></ol><pre><code>gzip UserBehavior.csv</code></pre><h3>创立内部 Stage</h3><ol><li>登入 Databend Cloud,并新建一个工作区。</li><li>在工作区中,执行以下 SQL 语句在阿里云上创立一个名为"mycsv"的内部 Stage:</li></ol><pre><code>CREATE STAGE mycsv URL = ‘s3://<YOUR_BUCKET_NAME>‘CONNECTION = ( ACCESS_KEY_ID = ‘<YOUR_ACCESS_KEY_ID>’, SECRET_ACCESS_KEY = ‘<YOUR_SECRET_ACCESS_KEY>’, ENDPOINT_URL = ‘<YOUR_ENDPOINT_URL>’, ENABLE_VIRTUAL_HOST_STYLE = TRUE)FILE_FORMAT = ( TYPE = CSV COMPRESSION = AUTO);</code></pre><ol start=“3”><li>执行以下 SQL 语句验证 Databend Cloud 是否可拜访到该内部 Stage:</li></ol><pre><code>LIST @mycsv;</code></pre><h3>上传数据集到内部 Stage</h3><p>应用 BendSQL将压缩后的数据集文件 (UserBehavior.csv.gz) 上传到内部 Stage。获取计算集群的连贯信息,请参考连贯到计算集群。</p><pre><code>(base) eric@Erics-iMac ~ % bendsql –host tenantID–YOUR_WAREHOUSE.gw.aliyun-cn-beijing.default.databend.cn \ –user=cloudapp \ –password=<YOUR_PASSWORD> \ –database=“default” \ –port=443 –tlsWelcome to BendSQL 0.9.3-db6b232(2023-10-26T12:36:55.578667000Z).Connecting to tenantID–YOUR_WAREHOUSE.gw.aliyun-cn-beijing.default.databend.cn:443 as user cloudapp.Connected to DatabendQuery v1.2.183-nightly-1ed9a826ed(rust-1.72.0-nightly-2023-10-28T22:10:15.618365223Z)cloudapp@tenantID–YOUR_WAREHOUSE.gw.aliyun-cn-beijing.default.databend.cn:443/default> PUT fs:///Users/eric/Documents/UserBehavior.csv.gz @mycsvPUT fs:///Users/eric/Documents/UserBehavior.csv.gz @mycsv┌─────────────────────────────────────────────────────────────────┐│ file │ status │ size ││ String │ String │ UInt64 │├───────────────────────────────────────────┼─────────┼───────────┤│ /Users/eric/Documents/UserBehavior.csv.gz │ SUCCESS │ 949805035 │└─────────────────────────────────────────────────────────────────┘1 file uploaded in 401.807 sec. Processed 1 file, 905.80 MiB (0.00 file/s, 2.25 MiB/s)</code></pre><h2>数据导入和荡涤</h2><h3>创建表格</h3><p>在工作区中,执行以下 SQL 语句为数据集创建表格:</p><pre><code>CREATE TABLE user_behavior ( user_id INT NOT NULL, item_id INT NOT NULL, category_id INT NOT NULL, behavior_type VARCHAR, ts TIMESTAMP, day DATE );</code></pre><h3>荡涤、导入数据</h3><ol><li><p>执行以下 SQL 语句导入数据到表格中,并同时实现荡涤:</p><ul><li>去除有效的工夫区外的数据</li><li>数据去重</li><li>生成额定列数据</li></ul></li></ol><pre><code>INSERT INTO user_behaviorSELECT $1,$2,$3,$4,to_timestamp($5::bigint) AS ts, to_date(ts) dayFROM @mycsv/UserBehavior.csv.gz WHERE day BETWEEN ‘2017-11-25’ AND ‘2017-12-03’GROUP BY $1,$2,$3,$4,ts;</code></pre><ol start=“2”><li>执行以下 SQL 语句验证数据导入是否胜利。该语句将返回表格的 10 行数据。</li></ol><pre><code>SELECT * FROM user_behavior LIMIT 10;</code></pre><h2>数据分析</h2><p>在实现了后期的筹备和数据导入之后,咱们正式开始进行数据分析。</p><h3>用户流量及购物状况剖析</h3><h4>总访问量和用户数</h4><pre><code>SELECT SUM(CASE WHEN behavior_type = ‘pv’ THEN 1 ELSE 0 END) as pv,COUNT(DISTINCT user_id) as uvFROM user_behavior;</code></pre><p></p><h4>日均访问量和用户量</h4><pre><code>SELECT day, SUM(CASE WHEN behavior_type = ‘pv’ THEN 1 ELSE 0 END) AS pv, COUNT(DISTINCT user_id) AS uvFROM user_behaviorGROUP BY dayORDER BY day;</code></pre><p></p><p>也能够通过 应用仪表盘 性能,生成折线图:</p><p></p><h4>统计每个用户的购物状况,生成新表:user_behavior_count</h4><pre><code>create table user_behavior_count as select user_id, sum(case when behavior_type = ‘pv’ then 1 else 0 end) as pv, –点击数 sum(case when behavior_type = ‘fav’ then 1 else 0 end) as fav, –珍藏数 sum(case when behavior_type = ‘cart’ then 1 else 0 end) as cart, –加购物车数 sum(case when behavior_type = ‘buy’ then 1 else 0 end) as buy –购买数 from user_behaviorgroup by user_id;</code></pre><h4>复购率:两次或两次以上购买的用户占购买用户的比例</h4><pre><code>select sum(case when buy > 1 then 1 else 0 end) / sum(case when buy > 0 then 1 else 0 end)from user_behavior_count;</code></pre><p></p><h3>用户行为转换率</h3><h4>点击/(加购物车 + 珍藏)/购买,各环节转化率</h4><pre><code>select a.pv, a.fav, a.cart, a.fav + a.cart as fav+cart, a.buy, round((a.fav + a.cart) / a.pv, 4) as pv2favcart, round(a.buy / (a.fav + a.cart), 4) as favcart2buy, round(a.buy / a.pv, 4) as pv2buyfrom(select sum(pv) as pv, –点击数sum(fav) as fav, –珍藏数sum(cart) as cart, –加购物车数sum(buy) as buy –购买数from user_behavior_count) as a;</code></pre><p></p><h4>计算一个小时实现浏览->增加到购物->并领取的用户</h4><pre><code>SELECT count_if(level>=1) as pv, count_if(level>=2) as cart, count_if(level>=3) as buyFROM( SELECT user_id, window_funnel(3600000000)(ts, behavior_type = ‘pv’,behavior_type = ‘cart’,behavior_type = ‘buy’) AS level FROM user_behavior GROUP BY user_id);</code></pre><p></p><h3>用户行为习惯</h3><h4>每天用户购物行为</h4><pre><code>select to_hour(ts) as hour, sum(case when behavior_type = ‘pv’ then 1 else 0 end) as pv, –点击数 sum(case when behavior_type = ‘fav’ then 1 else 0 end) as fav, –珍藏数 sum(case when behavior_type = ‘cart’ then 1 else 0 end) as cart, –加购物车数 sum(case when behavior_type = ‘buy’ then 1 else 0 end) as buy –购买数from user_behaviorgroup by hourorder by hour;</code></pre><p></p><p>也能够通过 应用仪表盘 性能,生成折线图:</p><p></p><h4>每周用户购物行为</h4><pre><code>select to_day_of_week(day) as weekday,day, sum(case when behavior_type = ‘pv’ then 1 else 0 end) as pv, –点击数 sum(case when behavior_type = ‘fav’ then 1 else 0 end) as fav, –珍藏数 sum(case when behavior_type = ‘cart’ then 1 else 0 end) as cart, –加购物车数 sum(case when behavior_type = ‘buy’ then 1 else 0 end) as buy –购买数from user_behaviorwhere day between ‘2017-11-27’ and ‘2017-12-03’group by weekday,dayorder by weekday;</code></pre><p></p><p>也能够通过 应用仪表盘 性能,生成柱状图:</p><p></p><h3>基于 RFM 模型找出有价值用户</h3><p>RFM 模型是掂量客户价值和客户创利能力的重要工具和伎俩,其中由 3 个因素形成了数据分析最好的指标:</p><ul><li>R-Recency(最近一次购买工夫)</li><li>F-Frequency(生产频率)</li><li>M-Money(生产金额)</li></ul><h4>R-Recency(最近购买工夫):R值越高,用户越沉闷</h4><pre><code>select user_id, to_date(‘2017-12-04’) - max(day) as R, dense_rank() over(order by (to_date(‘2017-12-04’) - max(day))) as R_rankfrom user_behaviorwhere behavior_type = ‘buy’group by user_idlimit 10;</code></pre><p></p><h3>F-Frequency(生产频率):F值越高,用户越虔诚</h3><pre><code>select user_id, count(1) as F, dense_rank() over(order by count(1) desc) as F_rankfrom user_behaviorwhere behavior_type = ‘buy’group by user_idlimit 10;</code></pre><p></p><h3>用户分组</h3><p>对有购买行为的用户依照排名进行分组,共划分为 5 组:</p><ul><li>前 1/5 的用户打 5 分</li><li>前 1/5 - 2/5 的用户打 4 分</li><li>前 2/5 - 3/5 的用户打 3 分</li><li>前 3/5 - 4/5 的用户打 2 分</li><li>其余用户打 1 分</li></ul><p>依照这个规定别离对用户工夫距离排名打分和购买频率排名打分,最初把两个分数合并在一起作为该名用户的最终评分。</p><pre><code>with cte as(select user_id, to_date(‘2017-12-04’) - max(day) as R, dense_rank() over(order by (to_date(‘2017-12-04’) - max(day))) as R_rank, count(1) as F, dense_rank() over(order by count(1) desc) as F_rankfrom user_behaviorwhere behavior_type = ‘buy’group by user_id)select user_id, R, R_rank, R_score, F, F_rank, F_score, R_score + F_score AS scorefrom(select *, case ntile(5) over(order by R_rank) when 1 then 5 when 2 then 4 when 3 then 3 when 4 then 2 when 5 then 1 end as R_score, case ntile(5) over(order by F_rank) when 1 then 5 when 2 then 4 when 3 then 3 when 4 then 2 when 5 then 1 end as F_scorefrom cte) as aorder by score desclimit 20;</code></pre><p></p><h3>商品维度剖析</h3><h4>销量最高的商品</h4><pre><code>select item_id , sum(case when behavior_type = ‘pv’ then 1 else 0 end) as pv, –点击数 sum(case when behavior_type = ‘fav’ then 1 else 0 end) as fav, –珍藏数 sum(case when behavior_type = ‘cart’ then 1 else 0 end) as cart, –加购物车数 sum(case when behavior_type = ‘buy’ then 1 else 0 end) as buy –购买数from user_behaviorgroup by item_idorder by buy desclimit 10;</code></pre><p></p><h4>销量最高的商品类别</h4><pre><code>select category_id , sum(case when behavior_type = ‘pv’ then 1 else 0 end) as pv, –点击数 sum(case when behavior_type = ‘fav’ then 1 else 0 end) as fav, –珍藏数 sum(case when behavior_type = ‘cart’ then 1 else 0 end) as cart, –加购物车数 sum(case when behavior_type = ‘buy’ then 1 else 0 end) as buy –购买数from user_behaviorgroup by category_idorder by buy desclimit 10;</code></pre><p></p><h3>用户留存剖析</h3><p>开始前,创建表格"day_users"并插入数据:</p><pre><code>create table day_users(day date,users bitmap);insert into day_users select day, build_bitmap(list(user_id::UInt64)) from user_behavior group by day;</code></pre><h4>统计每天UV</h4><pre><code>select day,bitmap_count(users) from day_users order by day;</code></pre><p></p><h4>绝对留存</h4><p>这里计算绝对于 11 月 23 日,12 月 2 号还在应用淘宝用户:</p><pre><code>select bitmap_count(bitmap_and(a.users, b.users))from (select users from day_users where day=‘2017-11-25’) a ,(select users from day_users where day=‘2017-12-02’) b;</code></pre><p></p><h4>绝对新增</h4><pre><code>select bitmap_count(bitmap_not(b.users, a.users)) from (select users from day_users where day=‘2017-11-25’) a ,(select users from day_users where day=‘2017-12-02’) b;</code></pre><p></p></article>