关于埋点:业务线用户路径问题

背景

每个2c公司的公司都会关注用户在本人app沉闷的用户的行为轨迹，每个公司的日志格局不统一，对应的做法也都不统一，这里记录一种解决的形式，日志数据来源于与无线日志，次要的存储计算介质为hive。
ps：只解决主流程的门路。

前提

日志格局

简略假如业务线的每行日志中都有uid标识一个惟一的用户，同时有对应的日志的工夫戳，同时每行日志还会有一个log_id的字段去代表本行日志，用ref_log_id字段去标识本行日志的起源页面的下级日志的log_id。所以日志能够简略的记录为一下的格局：
2020-09-15 12:00:13.343 uid=xxx,log_id=log_id_xxx,ref_lof_id=ref_log_id_xxx，page=detail
2020-09-15 13:00:14.343 uid=yyy,log_id=log_id_yyy,ref_lof_id=ref_log_id_yyy,page=list

日志格式化

对于上述比拟工整的格局的日志，咱们须要先将其中的每个字段都解析格式化解析进去。假如上述的日志曾经在hdfs的某个门路下了，在门路上建设表面去解析日志即可，建表用的是hive提供的正则去匹配字段，语句如下：

drop table tmp.log_test_20200915;CREATE EXTERNAL TABLE tmp.log_test_20200915(  time string COMMENT '工夫',   uid string COMMENT 'uid',  log_id string COMMENT 'log_id',   ref_log_id string COMMENT 'ref_log_id',  page string COMMENT '用户拜访的页面')COMMENT '测试表'ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES (   'input.regex'='^(.*) uid=(.*),log_id=(.*),ref_lof_id=(.*),page=(.*)$',   'output.format.string'='%1$s %2$s %3$s %4$s %5$s') LOCATION  'viewfs://cluster/user/test/logtest';

执行上述语句后的后果如下：

到此为止曾经将日志做了格式化，将每个字段都匹配了进去。

用户轨迹解决

用户数据预处理

在数据格式化当前须要对用户的数据进行预处理，将每个用户的数据整合到一起。
对于很多关怀用户数据的人来说，他们不关系具体的页面只关怀具体的页面类型，例如detail是详情页，list是列表页，同时可能会有多个详情页和列表页，首页订单页也类型。在这里咱们做一个小小的转换，详情页用D示意，list页用来标识，首页用H来示意...人为的在表外面加上一个page_type的字段去示意页面的类型。

with process_table as (    select        uid,        log_id,        ref_log_id,        page,        case when page  = 'detail' then 'D' when page = 'list' then 'L' end as page_type    from        tmp.log_test_20200915),aggregate_table as (    select        uid,        coalesce(concat('{', concat_ws(',', ws_log_id, ws_ref_log_id, ws_page, ws_page_type), '}'), '') json_str    from    (       select        uid,        log_id,        ref_log_id,        page,        page_type,        concat(concat_ws('\":\"', '\"log_id', coalesce(log_id, '')), '\"') ws_log_id,        concat(concat_ws('\":\"', '\"ref_log_id', coalesce(ref_log_id, '')), '\"') ws_ref_log_id,        concat(concat_ws('\":\"', '\"page', coalesce(page, '')), '\"') ws_page,        concat(concat_ws('\":\"', '\"page_type', coalesce(page_type, '')), '\"') ws_page_type    from        process_table    ) a),json_table as (    select        uid,        concat_ws(',', collect_set (json_str)) json_str    from        aggregate_table    group by        uid)    select        uid,        concat('[', json_str, ']') json_str    from        json_table

通过下面的sql，咱们能够失去一个 uid和用户在当天的所有数据的汇合。
执行上述的sql失去的后果如下：

udf办法解决用户门路

这里的例子有点不合理...最好应该是一个用户的两行日志比拟好了解数据。
假如咱们失去的数据是如下的一行吧

xxx     [{"log_id":"log_id_xxx_current","ref_log_id":"log_id_xxx","page":"detail","page_type":"D"},{"log_id":"log_id_xxx","ref_log_id":"ref_log_id_kkk,"page":"list","page_type":"L"}]

用户的门路是通过log_id来串联的，用udf办法匹配用户的门路。
思路如下：
用一个hash去保护页面上下级的关系，假如detail的下级是list页面，通过log_id的关联关系去递归的匹配用户的门路，始终到某行日志找不到对应的下级页面的日志未知。
代码如下：

import org.apache.hadoop.hive.ql.exec.UDF;import org.json.JSONArray;import org.json.JSONObject;import java.util.HashMap;/** * get user path in day * * @author test * created on 2020-01-08 */ public class GetUserPathFunction extends UDF {    private static HashMap<String, String> pageMap = new HashMap<String, String>() {{        put("list", "hotcity");        put("detail", "list,serch_list");        put("preOrder", "detail");        put("createOrder", "preOrder"); }}; //JSONArray result = new JSONArray(); /** * get user trace from super_log_core table * * @param paths user's paths in one day * @param logId user's log_id of this log * @param refLogId user's ref_lof_id of this log * @param page ticket page * @param pageType kind of page */  public String evaluate(String paths, String logId, String refLogId, String page, String pageType) throws Exception {        JSONArray result = new JSONArray();        innerFunction(paths, logId, refLogId, page, pageType, result);        return result.length() == 0 ? null : result.toString(); }public void innerFunction(String paths, String logId, String refLogId, String page, String pageType, JSONArray result) throws Exception {        if (null == logId || null == paths || "NULL".equals(logId) || "".equals(logId) || "".equals(paths)) {            return; } JSONObject jsonObject = new JSONObject(); jsonObject.put("log_id", logId); jsonObject.put("ref_log_id", refLogId); jsonObject.put("page", page); jsonObject.put("page_type", pageType); result.put(jsonObject); if (null == refLogId || "".equals(refLogId)) {            return; }        JSONArray jsonArray; try {            jsonArray = new JSONArray(paths); } catch (Exception e) {            throw new Exception(""); }        for (int i = 0; i < jsonArray.length(); i++) {            JSONObject innerJsonObejct = jsonArray.getJSONObject(i); if (innerJsonObejct.getString("log_id").equals(refLogId) && pageMap.containsKey(page) && pageMap.get(page).contains(innerJsonObejct.getString("page"))) {         logId = innerJsonObejct.getString("log_id");         refLogId = innerJsonObejct.getString("ref_log_id");         page = innerJsonObejct.getString("page");         pageType = innerJsonObejct.getString("page_type");         innerFunction(jsonArray.toString(), logId, refLogId, page, pageType, result);         break;  }        }    }

在hive中调用udf办法失去的path的构造和咱们传进去的类型，只不过咱们只会保留能串的起来的用户的门路。

后续解决

失去path当前再对path做拆分，能够失去一个更细化表不便去应用。构造大体如下

字段	类型	含意
uid	string	uid
log_id	string	以后日志的log_id
ref_log_id	string	本行日志的下级日志log_id
page	string	本行日志对应的页面号
page_type	string	页面类型,范畴HLDBO
h_log_id	string	本行日志的起源首页对应的log_id,如果起源门路中没有首页则为null
h_page	string	本行日志的起源首页对应的页面号,如果起源门路中没有首页则为null
l_log_id	string	本行日志的起源List页对应的log_id,如果起源门路中没有List页则为null
l_page	string	本行日志的起源List页对应的页面号,如果起源门路中没有List页则为null
d_log_id	string	D页对应的log_id,如果起源门路中没有D页则为null
d_page	string	D页对应的,如果起源门路中没有D页则为null
b_log_id	string	本行日志的起源Booking页对应的log_id,如果起源门路中没有Booking页则为null
b_page	string	本行日志的起源Booking页对应的页面号,如果起源门路中没有Booking页则为null
o_log_id	string	本行日志的起源(Order页对应的log_id,如果起源门路中没有Order页则为null
o_page	string	本行日志的起源Order页对应的页面号,如果起源门路中没有Order页则为null
path	string	用户门路,字符数组格局,数组外面是json,json中的字段有log_id,ref_log_id,page,page_type,数组中蕴含以后页面的json