python-udf方法

在做 ** 订单表的时候，发现订单产品部分的人群属性是在最内层的 json 中，但是人群的标识是不固定的值，也就是说 json 里面的 key 是不固定的，并且没个里面可能有多个 key。get_json_object 不能处理这种 case。

{“1”:{“Price”:{“name”:”CNY”,”amount”:169.0},”settlePrice”:{“name”:”CNY”,”amount”:147.0}},”2″:{“Price”:{“name”:”CNY”,”amount”:92.6},”settlePrice”:{“name”:”CNY”,”amount”:80.0}}}

例如上述的 json，处理完毕以后希望得到的结果是

1 Price 169 settlePrice 147
2 Price 92.6 settlePrice 80

从网上搜类似的问题，得到的答案大搜是用 java 去写一个 udf 方法，先获取 json 的 key 然后根据具体的 key 去 json 中取对应的值。理论上这个办法是可行的，因为部门的 udf 工程权限和发布流程都很长，所以一个 udf 方法从开发到上线可能半天就过去了，开发效率得不到保证。如果能写脚本话的 udf 方法就会快很多。

假设当前我们已经解析出了，json,order_id,sku_id,from_date,to_date 等几个字段在上级表中，接下来需要解析将 json 中的信息解析出来，python 的 uft 如下：

# coding=utf-8
# __author__ = 'zongrun.yang'
import sys
reload(sys)
sys.setdefaultencoding('utf8')
 
import json
import os
import sys
 
AR_PRICE = 'ARPrice'
SETTLE_PRICE = 'settlePrice'
AMOUNT = 'amount'
for line in sys.stdin:
    line,order_id,sku_id,from_date,to_date = line.strip('\n').split('\t')
 
    band_prices = json.loads(line)
    result = []
    for key in band_prices.keys():
        unit = band_prices[key]
        ar_price = unit[AR_PRICE][AMOUNT]
        settle_price = unit[SETTLE_PRICE][AMOUNT]
        result.append(order_id+'\t'+key+'\t'+str(ar_price)+'\t'+str(settle_price)+'\t'+sku_id+'\t'+from_date+'\t'+to_date)
    print '\n'.join(result)

在 hive 中加载上述文件：

add file ./test.py;

在 hive sql 中使用：

select
    transform(band_prices,order_id,sku_id,from_date,to_date) USING 'python test.py' AS (order_id,people,AR_price,settle_price,sku_id,from_date,to_date)
from new_ploy_unit;

1. 在 hive 中如果使用 python 的 udf 方法，必须将接下来都需要的字段全部放到 udf 中，不允许 udf 和单独的字段混合查询，下面的查询是不允许的

select

field1,

transform(field2) using 'python test.py' as (field3,field4)

from table

2. hive 查询的结果字段默认是按照 \t 分割的，所以在 python 处理的时候要按照 \t 去切割。

3. 如果对于一行数据需要转成多行，在 python 的输出中用 \n 分割即可。

背景

调研

hive UDF 方法