hive配置及外部函数

配置

全局：hive-site.xml外部命令： ./hive -hiveconf xxx会话设置(仅会话阶段失效)： jdbc:hive2://node3:10000> set mapreduce.job.reduces=3;

条件函数

if(条件,true return 执行内容, false return 执行内容)
case
- case a when b then c [when d then e]* [else f] end
  - 如果a=b返回c,否则判断是否a=d则返回e,都不对返回f
- case when a then b [when c then d]* [else e] end
  - 如果a条件成立返回d,若果c条件成立返回d,都不成立返回e

转换函数

cast函数
- 在应用hive进行SMALLINT与INT比拟时会隐式转换SMALLINT为INT，但咱们不能将更大级别的INT转为SMALLINT除非应用cast ； select cast(100 as SMALLINT);
- 任何整数类型都能够隐式地转换成一个范畴更大的类型。TINYINT,SMALLINT,INT,BIGINT,FLOAT和STRING都能够隐式地转换成DOUBLE；是的你没看出，STRING也能够隐式地转换成DOUBLE！然而你要记住，BOOLEAN类型不能转换为其余任何数据类型！

order by,sort by,distribute by,cluster by

当存在多个分区，distribute by与order by 不能共用，要换成sort by，如果是升序，此时能够间接换成cluster by

行列转换

函数
- concat 连贯简略类型的字符串
- concat_ws(连贯符号,简单类型输出)
- collect_set | collect_list
  - set操作在收集过程中，会去掉反复的数据，list不会
需要

#数据20      SMITH30      ALLEN30      WARD20      JONES30      MARTIN30      BLAKE10      CLARK20      SCOTT10      KING30      TURNER20      ADAMS30      JAMES20      FORD10      MILLERcreate table emp(deptno int,ename string) row format delimited fields terminated by '\t';select deptno,concat_ws("|",collect_set(ename)) from emp group by deptno;#其中select collect_set(ename) from emp group by deptno;+-----------------------------------------------------+--+| ["CLARK","KING","MILLER"]                           || ["SMITH","JONES","SCOTT","ADAMS","FORD"]            || ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  |+-----------------------------------------------------+--+

列转行
- explode(简单类型) UDTF表生成函数

select deptno, explode(name) from emp2;1.应用过程中如果在select之后，只许有一个函数，不许有其余字段2.不许被嵌套，但能够套其余函数解决方案： 侧视图：Lateral VIEWLATERAL VIEW udtf(expression) tableAlias AS columnAlias (相似一张长期表)select deptno,name from emp2 LATERAL VIEW explode(names) tmp as name;0: jdbc:hive2://node3:10000>    select *  from emp2 LATERAL VIEW explode(name) tmp as name1;+--------------+-----------------------------------------------------+------------+--+| emp2.deptno  |                      emp2.name                      | tmp.name1  |+--------------+-----------------------------------------------------+------------+--+| 10           | ["CLARK","KING","MILLER"]                           | CLARK      || 10           | ["CLARK","KING","MILLER"]                           | KING       || 10           | ["CLARK","KING","MILLER"]                           | MILLER     || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | SMITH      || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | JONES      || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | SCOTT      || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | ADAMS      || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | FORD       || 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | ALLEN      || 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | WARD       || 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | MARTIN     |

拼接简单类型

select deptno,concat_ws("|",collect_set(ename)) as c1 from emp group by depto;+---------+---------------------------------------+--+| deptno  |                  c1                   |+---------+---------------------------------------+--+| 10      | CLARK|KING|MILLER                     || 20      | SMITH|JONES|SCOTT|ADAMS|FORD          || 30      | ALLEN|WARD|MARTIN|BLAKE|TURNER|JAMES  |+---------+---------------------------------------+--+

将上述内容再还原为初始数据

select explode(split(c1,"\\|"))from(select deptno,concat_ws("|",collect_list(ename)) as c1 from emp group by deptno) as tmp

reflect函数

- 反对在sql中调用java中自带的函数- 格局：    - reflect('java.lang.Math','aba',-10)

hive剖析函数

剖析函数一

格局：剖析函数 over(partition by xxx order by xxx[asc|desc])
搁置地位：select 之后， from 之前
剖析函数：
- 求分组topn
  - row_number()不思考反复
  - rank()思考反复，后续的+n
  - dense_rank()思考反复但不占位，后续失常排
  - ntile(n) over(order by xxx) where xx=1，尽可能均分，能够搜寻前1/n

+--------------+----------------+--------+--+| t2.cookieid  | t2.createtime  | t2.pv  |+--------------+----------------+--------+--+| cookie1      | 2018-04-10     | 1      || cookie1      | 2018-04-11     | 5      || cookie1      | 2018-04-12     | 7      || cookie1      | 2018-04-13     | 3      || cookie1      | 2018-04-14     | 2      || cookie1      | 2018-04-15     | 4      || cookie1      | 2018-04-16     | 4      || cookie2      | 2018-04-10     | 2      || cookie2      | 2018-04-11     | 3      || cookie2      | 2018-04-12     | 5      || cookie2      | 2018-04-13     | 6      || cookie2      | 2018-04-14     | 3      || cookie2      | 2018-04-15     | 9      |select cookieid,rank() over(partition by cookieid order by pv desc) as rank2 from t2;

hive剖析函数二

--建表语句:create table itcast_t1(cookieid string,createtime string,   --day pv int) row format delimited fields terminated by ',';--加载数据：load data local inpath '/root/hivedata/itcast_t1.dat' into table itcast_t1;cookie1,2018-04-10,1cookie1,2018-04-11,5cookie1,2018-04-12,7cookie1,2018-04-13,3cookie1,2018-04-14,2cookie1,2018-04-15,4cookie1,2018-04-16,4

sum,avg,max,count
sum，级联求各种值的问题
- 利用场景，计算以后行和后面所有行，以后行和之前三行进行计算，前三行和后一行，以后和前面全副
  - sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
  - sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv3
  - sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv4
  - sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
  - sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6

剖析函数三

LAG,LEAD,FIRST_VALUE,LAST_VALUE
- 利用场景，前后两行的比拟

+--------------+----------------------+---------+--+| t4.cookieid  |    t4.createtime     | t4.url  |+--------------+----------------------+---------+--+| cookie1      | 2018-04-10 10:00:02  | url2    || cookie1      | 2018-04-10 10:00:00  | url1    || cookie1      | 2018-04-10 10:03:04  | 1url3   || cookie1      | 2018-04-10 10:50:05  | url6    || cookie1      | 2018-04-10 11:00:00  | url7    || cookie1      | 2018-04-10 10:10:00  | url4    || cookie1      | 2018-04-10 10:50:01  | url5    || cookie2      | 2018-04-10 10:00:02  | url22   || cookie2      | 2018-04-10 10:00:00  | url11   || cookie2      | 2018-04-10 10:03:04  | 1url33  |# lag、lead第二个参数示意以后条向后退\后退几次，第三个参数示意为空时默认填充SELECT cookieid,  createtime,  url,  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,  LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,  LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time   FROM t4;+-----------+----------------------+---------+-----+----------------------+--+| cookieid  |      createtime      |   url   | rn  |     last_1_time      |+-----------+----------------------+---------+-----+----------------------+--+| cookie1   | 2018-04-10 10:00:00  | url1    | 1   | 1970-01-01 00:00:00  || cookie1   | 2018-04-10 10:00:02  | url2    | 2   | 2018-04-10 10:00:00  || cookie1   | 2018-04-10 10:03:04  | 1url3   | 3   | 2018-04-10 10:00:02  || cookie1   | 2018-04-10 10:10:00  | url4    | 4   | 2018-04-10 10:03:04  || cookie1   | 2018-04-10 10:50:01  | url5    | 5   | 2018-04-10 10:10:00  || cookie1   | 2018-04-10 10:50:05  | url6    | 6   | 2018-04-10 10:50:01  |