hive配置及外部函数
配置
全局:hive-site.xml外部命令: ./hive -hiveconf xxx会话设置(仅会话阶段失效): jdbc:hive2://node3:10000> set mapreduce.job.reduces=3;
条件函数
- if(条件,true return 执行内容, false return 执行内容)
case
case a when b then c [when d then e]* [else f] end
- 如果a=b返回c,否则判断是否a=d则返回e,都不对返回f
case when a then b [when c then d]* [else e] end
- 如果a条件成立返回d,若果c条件成立返回d,都不成立返回e
转换函数
cast函数
- 在应用hive进行SMALLINT与INT比拟时会隐式转换SMALLINT为INT,但咱们不能将更大级别的INT转为SMALLINT除非应用cast ; select cast(100 as SMALLINT);
- 任何整数类型都能够隐式地转换成一个范畴更大的类型。TINYINT,SMALLINT,INT,BIGINT,FLOAT和STRING都能够隐式地转换成DOUBLE;是的你没看出,STRING也能够隐式地转换成DOUBLE!然而你要记住,BOOLEAN类型不能转换为其余任何数据类型!
order by,sort by,distribute by,cluster by
- 当存在多个分区,distribute by与order by 不能共用,要换成sort by,如果是升序,此时能够间接换成cluster by
行列转换
函数
- concat 连贯简略类型的字符串
- concat_ws(连贯符号,简单类型输出)
collect_set | collect_list
- set操作在收集过程中,会去掉反复的数据,list不会
- 需要
#数据20 SMITH30 ALLEN30 WARD20 JONES30 MARTIN30 BLAKE10 CLARK20 SCOTT10 KING30 TURNER20 ADAMS30 JAMES20 FORD10 MILLERcreate table emp(deptno int,ename string) row format delimited fields terminated by '\t';select deptno,concat_ws("|",collect_set(ename)) from emp group by deptno;#其中select collect_set(ename) from emp group by deptno;+-----------------------------------------------------+--+| ["CLARK","KING","MILLER"] || ["SMITH","JONES","SCOTT","ADAMS","FORD"] || ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"] |+-----------------------------------------------------+--+
列转行
- explode(简单类型) UDTF表生成函数
select deptno, explode(name) from emp2;1.应用过程中如果在select之后,只许有一个函数,不许有其余字段2.不许被嵌套,但能够套其余函数解决方案: 侧视图:Lateral VIEWLATERAL VIEW udtf(expression) tableAlias AS columnAlias (相似一张长期表)select deptno,name from emp2 LATERAL VIEW explode(names) tmp as name;0: jdbc:hive2://node3:10000> select * from emp2 LATERAL VIEW explode(name) tmp as name1;+--------------+-----------------------------------------------------+------------+--+| emp2.deptno | emp2.name | tmp.name1 |+--------------+-----------------------------------------------------+------------+--+| 10 | ["CLARK","KING","MILLER"] | CLARK || 10 | ["CLARK","KING","MILLER"] | KING || 10 | ["CLARK","KING","MILLER"] | MILLER || 20 | ["SMITH","JONES","SCOTT","ADAMS","FORD"] | SMITH || 20 | ["SMITH","JONES","SCOTT","ADAMS","FORD"] | JONES || 20 | ["SMITH","JONES","SCOTT","ADAMS","FORD"] | SCOTT || 20 | ["SMITH","JONES","SCOTT","ADAMS","FORD"] | ADAMS || 20 | ["SMITH","JONES","SCOTT","ADAMS","FORD"] | FORD || 30 | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"] | ALLEN || 30 | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"] | WARD || 30 | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"] | MARTIN |
- 拼接简单类型
select deptno,concat_ws("|",collect_set(ename)) as c1 from emp group by depto;+---------+---------------------------------------+--+| deptno | c1 |+---------+---------------------------------------+--+| 10 | CLARK|KING|MILLER || 20 | SMITH|JONES|SCOTT|ADAMS|FORD || 30 | ALLEN|WARD|MARTIN|BLAKE|TURNER|JAMES |+---------+---------------------------------------+--+
- 将上述内容再还原为初始数据
select explode(split(c1,"\\|"))from(select deptno,concat_ws("|",collect_list(ename)) as c1 from emp group by deptno) as tmp
reflect函数
- 反对在sql中调用java中自带的函数- 格局: - reflect('java.lang.Math','aba',-10)
hive剖析函数
剖析函数一
- 格局:剖析函数 over(partition by xxx order by xxx[asc|desc])
- 搁置地位:select 之后, from 之前
剖析函数:
求分组topn
- row_number()不思考反复
- rank()思考反复,后续的+n
- dense_rank()思考反复但不占位,后续失常排
- ntile(n) over(order by xxx) where xx=1,尽可能均分,能够搜寻前1/n
+--------------+----------------+--------+--+| t2.cookieid | t2.createtime | t2.pv |+--------------+----------------+--------+--+| cookie1 | 2018-04-10 | 1 || cookie1 | 2018-04-11 | 5 || cookie1 | 2018-04-12 | 7 || cookie1 | 2018-04-13 | 3 || cookie1 | 2018-04-14 | 2 || cookie1 | 2018-04-15 | 4 || cookie1 | 2018-04-16 | 4 || cookie2 | 2018-04-10 | 2 || cookie2 | 2018-04-11 | 3 || cookie2 | 2018-04-12 | 5 || cookie2 | 2018-04-13 | 6 || cookie2 | 2018-04-14 | 3 || cookie2 | 2018-04-15 | 9 |select cookieid,rank() over(partition by cookieid order by pv desc) as rank2 from t2;
hive剖析函数二
--建表语句:create table itcast_t1(cookieid string,createtime string, --day pv int) row format delimited fields terminated by ',';--加载数据:load data local inpath '/root/hivedata/itcast_t1.dat' into table itcast_t1;cookie1,2018-04-10,1cookie1,2018-04-11,5cookie1,2018-04-12,7cookie1,2018-04-13,3cookie1,2018-04-14,2cookie1,2018-04-15,4cookie1,2018-04-16,4
- sum,avg,max,count
sum,级联求各种值的问题
利用场景,计算以后行和后面所有行,以后行和之前三行进行计算,前三行和后一行,以后和前面全副
- sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
- sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv3
- sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv4
- sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
- sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6
剖析函数三
LAG,LEAD,FIRST_VALUE,LAST_VALUE
- 利用场景,前后两行的比拟
+--------------+----------------------+---------+--+| t4.cookieid | t4.createtime | t4.url |+--------------+----------------------+---------+--+| cookie1 | 2018-04-10 10:00:02 | url2 || cookie1 | 2018-04-10 10:00:00 | url1 || cookie1 | 2018-04-10 10:03:04 | 1url3 || cookie1 | 2018-04-10 10:50:05 | url6 || cookie1 | 2018-04-10 11:00:00 | url7 || cookie1 | 2018-04-10 10:10:00 | url4 || cookie1 | 2018-04-10 10:50:01 | url5 || cookie2 | 2018-04-10 10:00:02 | url22 || cookie2 | 2018-04-10 10:00:00 | url11 || cookie2 | 2018-04-10 10:03:04 | 1url33 |# lag、lead第二个参数示意以后条向后退\后退几次,第三个参数示意为空时默认填充SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time, LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time FROM t4;+-----------+----------------------+---------+-----+----------------------+--+| cookieid | createtime | url | rn | last_1_time |+-----------+----------------------+---------+-----+----------------------+--+| cookie1 | 2018-04-10 10:00:00 | url1 | 1 | 1970-01-01 00:00:00 || cookie1 | 2018-04-10 10:00:02 | url2 | 2 | 2018-04-10 10:00:00 || cookie1 | 2018-04-10 10:03:04 | 1url3 | 3 | 2018-04-10 10:00:02 || cookie1 | 2018-04-10 10:10:00 | url4 | 4 | 2018-04-10 10:03:04 || cookie1 | 2018-04-10 10:50:01 | url5 | 5 | 2018-04-10 10:10:00 || cookie1 | 2018-04-10 10:50:05 | url6 | 6 | 2018-04-10 10:50:01 |