关于hive:hive配置及内部函数

hive配置及外部函数

配置

全局:hive-site.xml
外部命令: ./hive -hiveconf xxx
会话设置(仅会话阶段失效): jdbc:hive2://node3:10000> set mapreduce.job.reduces=3;

条件函数

  • if(条件,true return 执行内容, false return 执行内容)
  • case

    • case a when b then c [when d then e]* [else f] end

      • 如果a=b返回c,否则判断是否a=d则返回e,都不对返回f
    • case when a then b [when c then d]* [else e] end

      • 如果a条件成立返回d,若果c条件成立返回d,都不成立返回e

转换函数

  • cast函数

    • 在应用hive进行SMALLINT与INT比拟时会隐式转换SMALLINT为INT,但咱们不能将更大级别的INT转为SMALLINT除非应用cast ; select cast(100 as SMALLINT);
    • 任何整数类型都能够隐式地转换成一个范畴更大的类型。TINYINT,SMALLINT,INT,BIGINT,FLOAT和STRING都能够隐式地转换成DOUBLE;是的你没看出,STRING也能够隐式地转换成DOUBLE!然而你要记住,BOOLEAN类型不能转换为其余任何数据类型!

order by,sort by,distribute by,cluster by

  • 当存在多个分区,distribute by与order by 不能共用,要换成sort by,如果是升序,此时能够间接换成cluster by

行列转换

  • 函数

    • concat 连贯简略类型的字符串
    • concat_ws(连贯符号,简单类型输出)
    • collect_set | collect_list

      • set操作在收集过程中,会去掉反复的数据,list不会
  • 需要

#数据
20      SMITH
30      ALLEN
30      WARD
20      JONES
30      MARTIN
30      BLAKE
10      CLARK
20      SCOTT
10      KING
30      TURNER
20      ADAMS
30      JAMES
20      FORD
10      MILLER

create table emp(
deptno int,
ename string
) row format delimited fields terminated by '\t';


select deptno,concat_ws("|",collect_set(ename)) from emp group by deptno;
#其中
select collect_set(ename) from emp group by deptno;
+-----------------------------------------------------+--+
| ["CLARK","KING","MILLER"]                           |
| ["SMITH","JONES","SCOTT","ADAMS","FORD"]            |
| ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  |
+-----------------------------------------------------+--+
  • 列转行

    • explode(简单类型) UDTF表生成函数
select deptno, explode(name) from emp2;
1.应用过程中如果在select之后,只许有一个函数,不许有其余字段
2.不许被嵌套,但能够套其余函数
解决方案: 侧视图:Lateral VIEW
LATERAL VIEW udtf(expression) tableAlias AS columnAlias (相似一张长期表)
select deptno,name from emp2 LATERAL VIEW explode(names) tmp as name;
0: jdbc:hive2://node3:10000>    select *  from emp2 LATERAL VIEW explode(name) tmp as name1;
+--------------+-----------------------------------------------------+------------+--+
| emp2.deptno  |                      emp2.name                      | tmp.name1  |
+--------------+-----------------------------------------------------+------------+--+
| 10           | ["CLARK","KING","MILLER"]                           | CLARK      |
| 10           | ["CLARK","KING","MILLER"]                           | KING       |
| 10           | ["CLARK","KING","MILLER"]                           | MILLER     |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | SMITH      |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | JONES      |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | SCOTT      |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | ADAMS      |
| 20           | ["SMITH","JONES","SCOTT","ADAMS","FORD"]            | FORD       |
| 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | ALLEN      |
| 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | WARD       |
| 30           | ["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]  | MARTIN     |
  • 拼接简单类型
select deptno,concat_ws("|",collect_set(ename)) as c1 from emp group by depto;
+---------+---------------------------------------+--+
| deptno  |                  c1                   |
+---------+---------------------------------------+--+
| 10      | CLARK|KING|MILLER                     |
| 20      | SMITH|JONES|SCOTT|ADAMS|FORD          |
| 30      | ALLEN|WARD|MARTIN|BLAKE|TURNER|JAMES  |
+---------+---------------------------------------+--+
  • 将上述内容再还原为初始数据
select explode(split(c1,"\\|"))
from
(select deptno,concat_ws("|",collect_list(ename)) as c1 from emp group by deptno) as tmp

reflect函数

- 反对在sql中调用java中自带的函数
- 格局:
    - reflect('java.lang.Math','aba',-10)

hive剖析函数

剖析函数一

  • 格局:剖析函数 over(partition by xxx order by xxx[asc|desc])
  • 搁置地位:select 之后, from 之前
  • 剖析函数:

    • 求分组topn

      • row_number()不思考反复
      • rank()思考反复,后续的+n
      • dense_rank()思考反复但不占位,后续失常排
      • ntile(n) over(order by xxx) where xx=1,尽可能均分,能够搜寻前1/n
+--------------+----------------+--------+--+
| t2.cookieid  | t2.createtime  | t2.pv  |
+--------------+----------------+--------+--+
| cookie1      | 2018-04-10     | 1      |
| cookie1      | 2018-04-11     | 5      |
| cookie1      | 2018-04-12     | 7      |
| cookie1      | 2018-04-13     | 3      |
| cookie1      | 2018-04-14     | 2      |
| cookie1      | 2018-04-15     | 4      |
| cookie1      | 2018-04-16     | 4      |
| cookie2      | 2018-04-10     | 2      |
| cookie2      | 2018-04-11     | 3      |
| cookie2      | 2018-04-12     | 5      |
| cookie2      | 2018-04-13     | 6      |
| cookie2      | 2018-04-14     | 3      |
| cookie2      | 2018-04-15     | 9      |
select cookieid,rank() over(partition by cookieid order by pv desc) as rank2 from t2;

hive剖析函数二

--建表语句:
create table itcast_t1(
cookieid string,
createtime string,   --day 
pv int
) row format delimited 
fields terminated by ',';

--加载数据:
load data local inpath '/root/hivedata/itcast_t1.dat' into table itcast_t1;

cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
  • sum,avg,max,count
  • sum,级联求各种值的问题

    • 利用场景,计算以后行和后面所有行,以后行和之前三行进行计算,前三行和后一行,以后和前面全副

      • sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv3
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv4
      • sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
      • sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6

剖析函数三

  • LAG,LEAD,FIRST_VALUE,LAST_VALUE

    • 利用场景,前后两行的比拟
+--------------+----------------------+---------+--+
| t4.cookieid  |    t4.createtime     | t4.url  |
+--------------+----------------------+---------+--+
| cookie1      | 2018-04-10 10:00:02  | url2    |
| cookie1      | 2018-04-10 10:00:00  | url1    |
| cookie1      | 2018-04-10 10:03:04  | 1url3   |
| cookie1      | 2018-04-10 10:50:05  | url6    |
| cookie1      | 2018-04-10 11:00:00  | url7    |
| cookie1      | 2018-04-10 10:10:00  | url4    |
| cookie1      | 2018-04-10 10:50:01  | url5    |
| cookie2      | 2018-04-10 10:00:02  | url22   |
| cookie2      | 2018-04-10 10:00:00  | url11   |
| cookie2      | 2018-04-10 10:03:04  | 1url33  |


# lag、lead第二个参数示意以后条向后退\后退几次,第三个参数示意为空时默认填充
SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
  LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
  FROM t4;


+-----------+----------------------+---------+-----+----------------------+--+
| cookieid  |      createtime      |   url   | rn  |     last_1_time      |
+-----------+----------------------+---------+-----+----------------------+--+
| cookie1   | 2018-04-10 10:00:00  | url1    | 1   | 1970-01-01 00:00:00  |
| cookie1   | 2018-04-10 10:00:02  | url2    | 2   | 2018-04-10 10:00:00  |
| cookie1   | 2018-04-10 10:03:04  | 1url3   | 3   | 2018-04-10 10:00:02  |
| cookie1   | 2018-04-10 10:10:00  | url4    | 4   | 2018-04-10 10:03:04  |
| cookie1   | 2018-04-10 10:50:01  | url5    | 5   | 2018-04-10 10:10:00  |
| cookie1   | 2018-04-10 10:50:05  | url6    | 6   | 2018-04-10 10:50:01  |

评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理