关于hive:hive配置及内部函数

60次阅读

共计 5786 个字符,预计需要花费 15 分钟才能阅读完成。

hive 配置及外部函数

配置

 全局:hive-site.xml
外部命令:./hive -hiveconf xxx
会话设置 (仅会话阶段失效):jdbc:hive2://node3:10000> set mapreduce.job.reduces=3;

条件函数

  • if(条件,true return 执行内容, false return 执行内容)
  • case

    • case a when b then c [when d then e]* [else f] end

      • 如果 a = b 返回 c, 否则判断是否 a = d 则返回 e, 都不对返回 f
    • case when a then b [when c then d]* [else e] end

      • 如果 a 条件成立返回 d, 若果 c 条件成立返回 d, 都不成立返回 e

转换函数

  • cast 函数

    • 在应用 hive 进行 SMALLINT 与 INT 比拟时会隐式转换 SMALLINT 为 INT,但咱们不能将更大级别的 INT 转为 SMALLINT 除非应用 cast;select cast(100 as SMALLINT);
    • 任何整数类型都能够隐式地转换成一个范畴更大的类型。TINYINT,SMALLINT,INT,BIGINT,FLOAT 和 STRING 都能够隐式地转换成 DOUBLE;是的你没看出,STRING 也能够隐式地转换成 DOUBLE!然而你要记住,BOOLEAN 类型不能转换为其余任何数据类型!

order by,sort by,distribute by,cluster by

  • 当存在多个分区,distribute by 与 order by 不能共用,要换成 sort by,如果是升序,此时能够间接换成 cluster by

行列转换

  • 函数

    • concat 连贯简略类型的字符串
    • concat_ws(连贯符号, 简单类型输出)
    • collect_set | collect_list

      • set 操作在收集过程中,会去掉反复的数据,list 不会
  • 需要

# 数据
20      SMITH
30      ALLEN
30      WARD
20      JONES
30      MARTIN
30      BLAKE
10      CLARK
20      SCOTT
10      KING
30      TURNER
20      ADAMS
30      JAMES
20      FORD
10      MILLER

create table emp(
deptno int,
ename string
) row format delimited fields terminated by '\t';


select deptno,concat_ws("|",collect_set(ename)) from emp group by deptno;
#其中
select collect_set(ename) from emp group by deptno;
+-----------------------------------------------------+--+
| ["CLARK","KING","MILLER"]                           |
| ["SMITH","JONES","SCOTT","ADAMS","FORD"]            |
| ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  |
+-----------------------------------------------------+--+
  • 列转行

    • explode(简单类型) UDTF 表生成函数
select deptno, explode(name) from emp2;
1. 应用过程中如果在 select 之后,只许有一个函数,不许有其余字段
2. 不许被嵌套,但能够套其余函数
解决方案:侧视图:Lateral VIEW
LATERAL VIEW udtf(expression) tableAlias AS columnAlias (相似一张长期表)
select deptno,name from emp2 LATERAL VIEW explode(names) tmp as name;
0: jdbc:hive2://node3:10000>    select *  from emp2 LATERAL VIEW explode(name) tmp as name1;
+--------------+-----------------------------------------------------+------------+--+
| emp2.deptno  |                      emp2.name                      | tmp.name1  |
+--------------+-----------------------------------------------------+------------+--+
| 10           | ["CLARK","KING","MILLER"]                           | CLARK      |
| 10           | ["CLARK","KING","MILLER"]                           | KING       |
| 10           | ["CLARK","KING","MILLER"]                           | MILLER     |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | SMITH      |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | JONES      |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | SCOTT      |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | ADAMS      |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | FORD       |
| 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | ALLEN      |
| 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | WARD       |
| 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | MARTIN     |
  • 拼接简单类型
select deptno,concat_ws("|",collect_set(ename)) as c1 from emp group by depto;
+---------+---------------------------------------+--+
| deptno  |                  c1                   |
+---------+---------------------------------------+--+
| 10      | CLARK|KING|MILLER                     |
| 20      | SMITH|JONES|SCOTT|ADAMS|FORD          |
| 30      | ALLEN|WARD|MARTIN|BLAKE|TURNER|JAMES  |
+---------+---------------------------------------+--+
  • 将上述内容再还原为初始数据
select explode(split(c1,"\\|"))
from
(select deptno,concat_ws("|",collect_list(ename)) as c1 from emp group by deptno) as tmp

reflect 函数

- 反对在 sql 中调用 java 中自带的函数
- 格局:- reflect('java.lang.Math','aba',-10)

hive 剖析函数

剖析函数一

  • 格局:剖析函数 over(partition by xxx order by xxx[asc|desc])
  • 搁置地位:select 之后,from 之前
  • 剖析函数:

    • 求分组 topn

      • row_number() 不思考反复
      • rank() 思考反复,后续的 +n
      • dense_rank() 思考反复但不占位,后续失常排
      • ntile(n) over(order by xxx) where xx=1,尽可能均分,能够搜寻前 1 /n
+--------------+----------------+--------+--+
| t2.cookieid  | t2.createtime  | t2.pv  |
+--------------+----------------+--------+--+
| cookie1      | 2018-04-10     | 1      |
| cookie1      | 2018-04-11     | 5      |
| cookie1      | 2018-04-12     | 7      |
| cookie1      | 2018-04-13     | 3      |
| cookie1      | 2018-04-14     | 2      |
| cookie1      | 2018-04-15     | 4      |
| cookie1      | 2018-04-16     | 4      |
| cookie2      | 2018-04-10     | 2      |
| cookie2      | 2018-04-11     | 3      |
| cookie2      | 2018-04-12     | 5      |
| cookie2      | 2018-04-13     | 6      |
| cookie2      | 2018-04-14     | 3      |
| cookie2      | 2018-04-15     | 9      |
select cookieid,rank() over(partition by cookieid order by pv desc) as rank2 from t2;

hive 剖析函数二

-- 建表语句:
create table itcast_t1(
cookieid string,
createtime string,   --day 
pv int
) row format delimited 
fields terminated by ',';

-- 加载数据:load data local inpath '/root/hivedata/itcast_t1.dat' into table itcast_t1;

cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
  • sum,avg,max,count
  • sum,级联求各种值的问题

    • 利用场景,计算以后行和后面所有行,以后行和之前三行进行计算,前三行和后一行,以后和前面全副

      • sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv3
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv4
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
      • sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6

剖析函数三

  • LAG,LEAD,FIRST_VALUE,LAST_VALUE

    • 利用场景,前后两行的比拟
+--------------+----------------------+---------+--+
| t4.cookieid  |    t4.createtime     | t4.url  |
+--------------+----------------------+---------+--+
| cookie1      | 2018-04-10 10:00:02  | url2    |
| cookie1      | 2018-04-10 10:00:00  | url1    |
| cookie1      | 2018-04-10 10:03:04  | 1url3   |
| cookie1      | 2018-04-10 10:50:05  | url6    |
| cookie1      | 2018-04-10 11:00:00  | url7    |
| cookie1      | 2018-04-10 10:10:00  | url4    |
| cookie1      | 2018-04-10 10:50:01  | url5    |
| cookie2      | 2018-04-10 10:00:02  | url22   |
| cookie2      | 2018-04-10 10:00:00  | url11   |
| cookie2      | 2018-04-10 10:03:04  | 1url33  |


# lag、lead 第二个参数示意以后条向后退 \ 后退几次,第三个参数示意为空时默认填充
SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
  LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
  FROM t4;


+-----------+----------------------+---------+-----+----------------------+--+
| cookieid  |      createtime      |   url   | rn  |     last_1_time      |
+-----------+----------------------+---------+-----+----------------------+--+
| cookie1   | 2018-04-10 10:00:00  | url1    | 1   | 1970-01-01 00:00:00  |
| cookie1   | 2018-04-10 10:00:02  | url2    | 2   | 2018-04-10 10:00:00  |
| cookie1   | 2018-04-10 10:03:04  | 1url3   | 3   | 2018-04-10 10:00:02  |
| cookie1   | 2018-04-10 10:10:00  | url4    | 4   | 2018-04-10 10:03:04  |
| cookie1   | 2018-04-10 10:50:01  | url5    | 5   | 2018-04-10 10:10:00  |
| cookie1   | 2018-04-10 10:50:05  | url6    | 6   | 2018-04-10 10:50:01  |

正文完
 0