hive配置及外部函数

配置

全局:hive-site.xml外部命令: ./hive -hiveconf xxx会话设置(仅会话阶段失效): jdbc:hive2://node3:10000> set mapreduce.job.reduces=3;

条件函数

  • if(条件,true return 执行内容, false return 执行内容)
  • case

    • case a when b then c [when d then e]* [else f] end

      • 如果a=b返回c,否则判断是否a=d则返回e,都不对返回f
    • case when a then b [when c then d]* [else e] end

      • 如果a条件成立返回d,若果c条件成立返回d,都不成立返回e

转换函数

  • cast函数

    • 在应用hive进行SMALLINT与INT比拟时会隐式转换SMALLINT为INT,但咱们不能将更大级别的INT转为SMALLINT除非应用cast ; select cast(100 as SMALLINT);
    • 任何整数类型都能够隐式地转换成一个范畴更大的类型。TINYINT,SMALLINT,INT,BIGINT,FLOAT和STRING都能够隐式地转换成DOUBLE;是的你没看出,STRING也能够隐式地转换成DOUBLE!然而你要记住,BOOLEAN类型不能转换为其余任何数据类型!

order by,sort by,distribute by,cluster by

  • 当存在多个分区,distribute by与order by 不能共用,要换成sort by,如果是升序,此时能够间接换成cluster by

行列转换

  • 函数

    • concat 连贯简略类型的字符串
    • concat_ws(连贯符号,简单类型输出)
    • collect_set | collect_list

      • set操作在收集过程中,会去掉反复的数据,list不会
  • 需要

#数据20      SMITH30      ALLEN30      WARD20      JONES30      MARTIN30      BLAKE10      CLARK20      SCOTT10      KING30      TURNER20      ADAMS30      JAMES20      FORD10      MILLERcreate table emp(deptno int,ename string) row format delimited fields terminated by '\t';select deptno,concat_ws("|",collect_set(ename)) from emp group by deptno;#其中select collect_set(ename) from emp group by deptno;+-----------------------------------------------------+--+| ["CLARK","KING","MILLER"]                           || ["SMITH","JONES","SCOTT","ADAMS","FORD"]            || ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  |+-----------------------------------------------------+--+
  • 列转行

    • explode(简单类型) UDTF表生成函数
select deptno, explode(name) from emp2;1.应用过程中如果在select之后,只许有一个函数,不许有其余字段2.不许被嵌套,但能够套其余函数解决方案: 侧视图:Lateral VIEWLATERAL VIEW udtf(expression) tableAlias AS columnAlias (相似一张长期表)select deptno,name from emp2 LATERAL VIEW explode(names) tmp as name;0: jdbc:hive2://node3:10000>    select *  from emp2 LATERAL VIEW explode(name) tmp as name1;+--------------+-----------------------------------------------------+------------+--+| emp2.deptno  |                      emp2.name                      | tmp.name1  |+--------------+-----------------------------------------------------+------------+--+| 10           | ["CLARK","KING","MILLER"]                           | CLARK      || 10           | ["CLARK","KING","MILLER"]                           | KING       || 10           | ["CLARK","KING","MILLER"]                           | MILLER     || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | SMITH      || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | JONES      || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | SCOTT      || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | ADAMS      || 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | FORD       || 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | ALLEN      || 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | WARD       || 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | MARTIN     |
  • 拼接简单类型
select deptno,concat_ws("|",collect_set(ename)) as c1 from emp group by depto;+---------+---------------------------------------+--+| deptno  |                  c1                   |+---------+---------------------------------------+--+| 10      | CLARK|KING|MILLER                     || 20      | SMITH|JONES|SCOTT|ADAMS|FORD          || 30      | ALLEN|WARD|MARTIN|BLAKE|TURNER|JAMES  |+---------+---------------------------------------+--+
  • 将上述内容再还原为初始数据
select explode(split(c1,"\\|"))from(select deptno,concat_ws("|",collect_list(ename)) as c1 from emp group by deptno) as tmp

reflect函数

- 反对在sql中调用java中自带的函数- 格局:    - reflect('java.lang.Math','aba',-10)

hive剖析函数

剖析函数一

  • 格局:剖析函数 over(partition by xxx order by xxx[asc|desc])
  • 搁置地位:select 之后, from 之前
  • 剖析函数:

    • 求分组topn

      • row_number()不思考反复
      • rank()思考反复,后续的+n
      • dense_rank()思考反复但不占位,后续失常排
      • ntile(n) over(order by xxx) where xx=1,尽可能均分,能够搜寻前1/n
+--------------+----------------+--------+--+| t2.cookieid  | t2.createtime  | t2.pv  |+--------------+----------------+--------+--+| cookie1      | 2018-04-10     | 1      || cookie1      | 2018-04-11     | 5      || cookie1      | 2018-04-12     | 7      || cookie1      | 2018-04-13     | 3      || cookie1      | 2018-04-14     | 2      || cookie1      | 2018-04-15     | 4      || cookie1      | 2018-04-16     | 4      || cookie2      | 2018-04-10     | 2      || cookie2      | 2018-04-11     | 3      || cookie2      | 2018-04-12     | 5      || cookie2      | 2018-04-13     | 6      || cookie2      | 2018-04-14     | 3      || cookie2      | 2018-04-15     | 9      |select cookieid,rank() over(partition by cookieid order by pv desc) as rank2 from t2;

hive剖析函数二

--建表语句:create table itcast_t1(cookieid string,createtime string,   --day pv int) row format delimited fields terminated by ',';--加载数据:load data local inpath '/root/hivedata/itcast_t1.dat' into table itcast_t1;cookie1,2018-04-10,1cookie1,2018-04-11,5cookie1,2018-04-12,7cookie1,2018-04-13,3cookie1,2018-04-14,2cookie1,2018-04-15,4cookie1,2018-04-16,4
  • sum,avg,max,count
  • sum,级联求各种值的问题

    • 利用场景,计算以后行和后面所有行,以后行和之前三行进行计算,前三行和后一行,以后和前面全副

      • sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv3
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv4
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
      • sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6

剖析函数三

  • LAG,LEAD,FIRST_VALUE,LAST_VALUE

    • 利用场景,前后两行的比拟
+--------------+----------------------+---------+--+| t4.cookieid  |    t4.createtime     | t4.url  |+--------------+----------------------+---------+--+| cookie1      | 2018-04-10 10:00:02  | url2    || cookie1      | 2018-04-10 10:00:00  | url1    || cookie1      | 2018-04-10 10:03:04  | 1url3   || cookie1      | 2018-04-10 10:50:05  | url6    || cookie1      | 2018-04-10 11:00:00  | url7    || cookie1      | 2018-04-10 10:10:00  | url4    || cookie1      | 2018-04-10 10:50:01  | url5    || cookie2      | 2018-04-10 10:00:02  | url22   || cookie2      | 2018-04-10 10:00:00  | url11   || cookie2      | 2018-04-10 10:03:04  | 1url33  |# lag、lead第二个参数示意以后条向后退\后退几次,第三个参数示意为空时默认填充SELECT cookieid,  createtime,  url,  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,  LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,  LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time   FROM t4;+-----------+----------------------+---------+-----+----------------------+--+| cookieid  |      createtime      |   url   | rn  |     last_1_time      |+-----------+----------------------+---------+-----+----------------------+--+| cookie1   | 2018-04-10 10:00:00  | url1    | 1   | 1970-01-01 00:00:00  || cookie1   | 2018-04-10 10:00:02  | url2    | 2   | 2018-04-10 10:00:00  || cookie1   | 2018-04-10 10:03:04  | 1url3   | 3   | 2018-04-10 10:00:02  || cookie1   | 2018-04-10 10:10:00  | url4    | 4   | 2018-04-10 10:03:04  || cookie1   | 2018-04-10 10:50:01  | url5    | 5   | 2018-04-10 10:10:00  || cookie1   | 2018-04-10 10:50:05  | url6    | 6   | 2018-04-10 10:50:01  |