关于hive:Hive – 乐趣区

一、Hive介绍

1.1 Hive介绍

1.1.1 OLTP和OLAP

数据处理大抵能够分成两大类：联机事务处理OLTP（on-line transaction processing）、联机剖析解决OLAP（On-Line Analytical Processing）。OLTP是传统的关系型数据库的次要利用，次要是根本的、日常的事务处理，例如银行交易。OLAP是数据仓库零碎的次要利用，反对简单的剖析操作，偏重决策反对，并且提供直观易懂的查问后果。

OLTP 零碎强调数据库内存效率，强调内存各种指标的命令率，强调绑定变量，强调并发操作；
OLAP 零碎则强调数据分析，强调SQL执行市场，强调磁盘I/O，强调分区等。

1.1.2 OLTP和OLAP区别

我的项目	OLTP	OLAP
用户	操作人员，底层管理人员	决策人员，高级管理人员
性能	日常操作解决	剖析决策
DB设计	面向利用	面向主题
数据	以后的，最新的新界的，二维的分立的	历史的，汇集的，多维的，集成的对立的
存取	读写数十条记录	读写百万条记录
工作单位	简略的事务	简单查问
用户数	上千个	上百万个
DB大小	100MB-GB	100GB-TB
工夫要求	具备实时性	对工夫要求不严格
次要利用	数据库	数据仓库

1.1.3 Hive构造

Apache Hive™数据仓库软件反对应用SQL读取、写入和治理驻留在分布式存储中的大型数据集。构造能够投影到已经存储的数据上。提供了一个命令行工具和JDBC驱动程序来将用户连贯到Hive。 Hive不能独自应用，它依赖HDFS。Hive不存储数据，它的数据存储到HDFS。Hive自身也不具备剖析能力，它依赖了底层的分布式计算框架（Mapreduce/spark）。 hive自身最开始facebook公布的。最终要mapreduce来解决来自它自身社交平台的海量数据。hive纳入到apache 的基金会。以下为hive结构图：

hive有3中应用形式：

CLi命令行
JDBC/ODBC api接口：应用之前要启动Thrift server服务
Web GUI页面

1.1.4 Hive元数据管理

Hive元数据管理默认应用Derby，只是Hive内嵌的数据库，反对单会话。也能够指定其余数据库治理元数据，如Mysql

Derby：内嵌自带，单会话
MySQL：反对多会话，需要额定进行配置

1.2 Hive装置与应用

1.2.1 Hive装置

1、上传hive安装包apache-hive-1.2.1-bin.tar.gz到linux
2、上传mysql的驱动jar包mysql-connector-java-5.1.47-bin.jar到linux
3、解压hive安装包
4、批改hive-env.sh.template 名称为 hive-env.sh
5、批改hive-env.sh
        HADOOP_HOME=/opt/hadoop-2.8.1
        export HIVE_CONF_DIR=/opt/hive-1.2.1/conf
6、批改环境变量
        export HIVE_HOME=/opt/hive-1.2.1
        export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin

1.2.2 Hive启动

启动hive之前要先启动hdfs和yarn，如果hdfs启动了然而hive还是无奈连贯，请查看你的hdfs是否处于平安模式。如果是平安模式，请先退出hdfs的平安模式之后，再尝试连贯hive。

CLi启动：

$hive

Beeline启动：

# 启动Thrift
$ hive --service hiveserver2 & (&：作用是在后盾启动)
$ hive --service metastore &

#启动Beeline
$ beeline
$ beeline> !connect jdbc:hive2://hdp22:10000
$ Enter username for jdbc:hive2://hdp22:10000: root
$ Enter password for jdbc:hive2://hdp22:10000: 123456
或
$ beeline -u jdbc:hive2://hdp22:10000

1.2.3 MYSQL治理元数据（可选）

mysql创立hive用户

mysql>CREATE USER  'hive' IDENTIFIELD BY '123456'
mysql>GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%' WITH GRANT OPTION;
mysql>flush privileges;

批改conf/hive-site.xml文件

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>password</value>
</property>

将驱动包复制到lib/目录

cp mysql-connector-java-5.1.47-bin.jar  hive-1.2.1/lib/

1.2.4 测试

1、查看数据库： show databases;
2、创立数据库：
create database hive;
use hive;

3、创立表
create table t_user3 (
id int,
name string,
sex string,
money double
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
解释
row format delimited        行格式化
fieids terminated by '\t'     属性之间用什么宰割
lines terminated by '\n'      行之间用什么宰割
stored as textfile            存成什么数据
tip:
当表和库建好之后，会在hdfs的对应目录下生成一系列的目录：
/user/hive/warehouse/hive.db/t_user

4、查看表
show tables;

5、上传数据
load data [local] inpath '/home/user.txt' into table t_user;
tip: 导入数据到指定的表中
local ： 示意从本地文件系统（linux）中导入数据，如果不加local示意从hdfs文件系统中导入数据。

6、查看表
hive> select * from t_user;
OK
1       lixi    man     1000.0
2       rock    woman   1001.0

7、退出hive
quit；

1.3 DDL

1.3.1 数据库

（1）增
CREATE DATABASE IF NOT EXISTS database_name

（2）删
DROP  DATABASE [IF EXISTS] database_name 
DROP  DATABASE [IF EXISTS] database_name RESTRICT    //非严格的
DROP  DATABASE [IF EXISTS] database_name CASCADE     //级联删除

（3）改
//批改属性
ALTER DATABASE [IF EXISTS] database_name  SET DBPROPERTIES (property_name=property_value,...)   
//批改所有者
ALTER DATABASE [IF EXISTS] database_name  SET OWNER [USER|ROLE] user_or_role     
//批改地位
ALTER DATABASE [IF EXISTS] database_name  SET LOCATION hdfs_path                                

（4）应用
USE database_name
USE DEFAULT //应用默认数据库

（5）显示
SHOW DATABASES //显示所有数据库

1.3.2 创立表

语法

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];
 
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)
 
primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)
 
array_type
  : ARRAY < data_type >
 
map_type
  : MAP < primitive_type, data_type >
 
struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>
 
union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
 
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
 
file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | JSONFILE    -- (Note: Available in Hive 4.0.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
 
column_constraint_specification:
  : [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK  [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
 
default_value:
  : [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ] 
 
constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 
    [, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]

例子：

（1）创立表
========================================================
CREATE TABLE IF NOT EXISTS IPTV_JSCN_NJ.std_category (
colum1 type，
colom2 type，
...
)
ROW FORMAT DELIMITED        -- 行格式化
FIELDS TERMINATED BY '\001'    -- 列宰割符号 -- 默认为 '\001' （ctrl + A 或者 ^A）
LINES TERMINATED BY '\n'      -- 行宰割符号
STORED AS TEXTFILE;            -- 存成什么数据
LOCATION "/user/hive/warehouse/hive.db/t_user"    -- 表数据存储目录



（2）创立内部表
=======================================================
应用【EXTERNAL】作为标识
外部表和内部表的区别
1. 删除外部表：删除元数据，数据删除
2. 删除内部表：删除元数据，数据不删除
CREATE 【EXTERNAL】 TABLE IF NOT EXISTS IPTV_JSCN_NJ.std_category (
colum1 type，
colom2 type，
...
)
ROW FORMAT DELIMITED        -- 行格式化
FIELDS TERMINATED BY '\001'    -- 列宰割符号 -- 默认为 '\001' （ctrl + A 或者 ^A）
LINES TERMINATED BY '\n'      -- 行宰割符号
STORED AS TEXTFILE;            -- 存成什么数据
LOCATION "/user/hive/warehouse/hive.db/t_user"    -- 表数据存储目录


（3）创立分区表
=======================================================
CREATE 【EXTERNAL】 TABLE IF NOT EXISTS IPTV_JSCN_NJ.std_category (
colum1 type，
colom2 type，
...
)
PARTITIONED BY (dt string, siteKey string) -- 应用dt和siteKey作为分区字段
ROW FORMAT DELIMITED        -- 行格式化
FIELDS TERMINATED BY '\001'    -- 列宰割符号 -- 默认为 '\001' （ctrl + A 或者 ^A）
LINES TERMINATED BY '\n'      -- 行宰割符号
STORED AS TEXTFILE;            -- 存成什么数据
LOCATION "/user/hive/warehouse/hive.db/t_user"    -- 表数据存储目录


（4）创立分桶表
必须设置reduceTask个数（决定了分桶个数）：set mapreduce.job.reduces=3;
=======================================================
CREATE 【EXTERNAL】 TABLE IF NOT EXISTS IPTV_JSCN_NJ.std_category (
colum1 type，
colom2 type，
...
)
PARTITIONED BY (dt string, siteKey string) -- 应用dt和siteKey作为分区字段
CLUSTERED BY (colum1) SORTED BY (colum1 ASC) INTO 3 BUCKETS  -- 以colum1作为分桶字段，应用colum1排序，分3个桶
ROW FORMAT DELIMITED        -- 行格式化
FIELDS TERMINATED BY '\001'    -- 列宰割符号 -- 默认为 '\001' （ctrl + A 或者 ^A）
LINES TERMINATED BY '\n'      -- 行宰割符号
STORED AS TEXTFILE;            -- 存成什么数据
LOCATION "/user/hive/warehouse/hive.db/t_user"    -- 表数据存储目录

1.2.2 删除表

（1）删除表
=========================
DROP TABLE table_name

（2）删除表（重建表，表数据失落）
=========================
TRUNCATE TABLE table_name

1.2.3 批改表

严格读时模式：在hive进行数据导入的时候并不会校验你的数据格式是否合乎hive表的标准，只有在你进行查问的才会去校验你的文件中的数据的格局，如果格局不满足就会应用NULL来代替你的字段。
严格写时模式：在进行数据插入的时候会严格的校验你的语法格局。最典型的例子，就是有主外键关联的表，你在插入一个有外键的值的时候，外键的值必须要从另一个关联表中的主键中抉择其中一个值。

(1)批改表构造
=========================
Alter Table
    Rename Table【批改表名称】    
    Alter Table Properties【批改表属性】
    Change Column Name/Type/Position【批改列名称，数据类型，地位】
    Add/Replace Columns【增加列，替换所有列】
    
1.批改表名称：ALTER TABLE table_name RENAME table_name2
2.批改表属性：ALTER TABLE table_name SET TBLPROPERTIES ('EXTERNAL' = 'TRUE');  -- 批改外部表为内部表
3.批改列名称：ALTER TABLE table_name CHANGE COLUMN name1 name2 string；
4.批改列地位：ALTER TABLE table_name CHANGE COLUMN name1 name2 string FIRST；  -- 移到第一列
5.批改列地位：ALTER TABLE table_name CHANGE COLUMN name1 name2 string AFTER id -- 移到某列之后
6.增加列：ALTER TABLE table_name ADD COLUMNS (id int, name string)
7.整体替换列：ALTER TABLE table_name REPLACE COLUMNS(id int,name string)
    
    
（2）批改表分区
=========================
Alter Partition
    Add Partitions【增加分区】
    Dynamic Partitions【动静分区】
    Rename Partition【重命名分区】
    Drop Partitions【删除分区】

1.增加分区：  ALTER TABLE table_name ADD PARTITION (location = 'value1')
2.重命名分区：ALTER TABLE table_name PARTITION location_old RENAME TO PARTITION location_new
3.删除分区：  ALTER TABLE table_name DROP [IF EXISTS] PARTITION(location = value1)

1.2.4 查问表

1.查看表构造：DESC tablename
2.查看表具体构造：DESC EXTENDED tablename           -- 含扩大信息
3.查看表格式化构造：DESC FORMATTED tablename     -- 含扩大信息
4.查看DDL语句：SHOW CREATE TABLE table_name


例：
hive> desc extended t_1;
OK
id                      int
name                    string
sex                     string
money                   double

Detailed Table Information      Table(tableName:t_1, dbName:hive, owner:root, createTime:1557309921, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:sex, type:string, comment:null), FieldSchema(name:money, type:double, comment:null)], location:hdfs://y51:9000/user/hive/warehouse/hive.db/t_1, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{line.delim=
, field.delim=  , serialization.format= }), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{commment=this is jianghai0702 table, last_modified_time=1557378293, totalSize=76, numRows=-1, rawDataSize=-1, COLUMN_STATS_ACCURATE=false, numFiles=2, transient_lastDdlTime=1557378293, last_modified_by=root}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)

1.4、DML

1.4.1 加载数据

文件->Table

hive> LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

1.4.2 插入数据

将数据插入在某表，会执行mapreduce

2.1 Into 是追加数据
hive> INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] 
select_statement1 FROM from_statement;

2.2 Overwrite 是将数据进行笼罩
hive> INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] 
select_statement1 FROM from_statement;

1.4.3 插入目录

将数据写到某目录下

hive> INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] 
  SELECT ... FROM ...
 
e.g.
hive> insert overwrite directory '/write'
row format delimited
select * from part_buk;

1.4.4 插入表（values形式）

（插入数据，values传值形式）

hive> INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]

e.g.
hive> insert into t_stu values(4,'qphone','woman',18,'java');

1.4.5 更新数据（X）

(反对hive中反对ACID的事务的文件格式;如：Orc;应用thrift服务的形式能力应用)

update操作不会批改原有的数据，而是会创立一个新的版本，在新的版本中将要批改的数据做一个标记

UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

e.g.
1. 批改hive-site.xml
<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
  </property>
    <property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
  </property>
    <property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
  </property>
  <property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
  </property>
    <property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
  </property>


2. 建表
create table t_stu3 (
id int,
name string,
sex string,
age int,
subject string
)
clustered by (id) into 2 buckets
row format delimited
fields terminated by ','
stored as orc
TBLPROPERTIES('transactional'='true');

insert into t_stu3
select * from t_stu;

update t_stu3 set name='curry',subject='basketball' where id=17;

1.4.6 删除数据（X）

DELETE FROM tablename [WHERE expression]

1.4.7 合并小文件

小文件预防（hive2默认开启）

//每个Map最大输出大小(这个值决定了合并后文件的数量)
set mapred.max.split.size=256000000;  
//一个节点上split的至多的大小(这个值决定了多个DataNode上的文件是否须要合并)
set mapred.min.split.size.per.node=100000000;
//一个交换机下split的至多的大小(这个值决定了多个交换机上的文件是否须要合并)  
set mapred.min.split.size.per.rack=100000000;
//执行Map前进行小文件合并
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; 
//设置map端输入进行合并，默认为true
set hive.merge.mapfiles = true
//设置reduce端输入进行合并，默认为false
set hive.merge.mapredfiles = true
//设置合并文件的大小
set hive.merge.size.per.task = 256*1000*1000
//当输入文件的均匀大小小于该值时，启动一个独立的MapReduce工作进行文件merge。
set hive.merge.smallfiles.avgsize=16000000

有些公司用的版本不同，低版本可能有些配置不一样，最好检查一下下面这些配置是否设置，而后依据本人的理论集群状况进行设置。

小文件的产生起因

动静分区插入的时候，保障有动态分区，不要误判导致产生大量分区，大量分区加起来，天然就有大量小文件；
如果源表是有大量小文件的，在导入数据到指标表的时候，如果只是insert into dis select * from origin的话，指标表通常也有很多小文件。如果有分区，比方dt, hour，能够应用distribute by dt, hour，保障每个小时的数据在一个reduce外面；
相似sqoop增量导入，还有hive一些表的查问增量导入，这些必定是有小文件的，须要进行一周甚至一天定时工作的小文件合并。

小文件解决

从小文件产生的途经就能够从源头上管制小文件数量，办法如下：

1.应用Sequencefile作为表存储格局，不要用textfile，在肯定水平上能够缩小小文件。

2.缩小reduce的数量(能够应用参数进行管制)。

3.少用动静分区，用时记得按distribute by分区。

对已有的

1.应用hadoop archive命令把小文件进行归档。

2.重建表，建表时缩小reduce数量。

3.通过参数进行调节，设置map/reduce端的相干参数，如下：

下面是平时开发数据工作时候，小文件的预防，但如果因为咱们的粗心，小文件问题曾经产生了，就须要解决了。通常就是insert overwrite了。

insert overwrite table test [partition(hour=...)] select * from test distribute by floor (rand()*5);

注：这个语句把test表的数据查问进去，overwrite笼罩test表，不必放心如果overwrite失败，数据没了，这外面是有事物性保障的，能够察看一下执行的时候，在test表hdfs文件目录上面有个长期文件夹。如果是分区表，加上partition，示意对该分区进行overwrite。如果是orc格局存储的表，还能够应用alter table test [partition(...)] concatenate进行小文件的合并，不过这种办法仅仅实用于orc格局存储的表。

1.4.8 导入导出：IMPORT 、EXPORT

只反对HDFS

语法：

导出语法
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
TO 'export_target_path' [ FOR replication('eventid') ]

导入语法
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]
FROM 'source_path'
[LOCATION 'import_target_path']

导入

将log.txt导入table_1
IMPORT TABLE table_1 FROM '/home/log.txt'  

将log.txt导入table_1的name1分区
IMPORT TABLE table_1 PARTITION(partition="name1") 
FROM '/home/log.txt'

导出

将table_1数据导出到00
EXPORT TABLE tabel_1 TO '/home/00' 

将table_1的name1分区的数据导出到00
EXPORT TABLE tabel_1 PARTITION(partition="name1")
TO '/home/00'

1.4.9 导入导出：应用插入数据的形式实现

反对HDFS和本地Local

导入

1 本地零碎中导入 load data local inpath '' into table
2 hdfs中导入    load data inpath '' into table
3 间接将文件上传到hdfs的表目录中
4 克隆数据：location
5 表与表之间间接进行导入
    insert into destine
    select * from source;
6 create as
    create table destine
    as
    select * from source;
7 批量导入
    FROM t_index
    insert into table t_insert
    SELECT * where id = 4
    insert into table hive.t_stu
    SELECT * where id = 1
    ;

导出

1. 导出到本地文件系统
2. 导出到hdfs零碎
3. 建一个表的数据导入到另一个表中

关键字：insert overwrite 

---导出到本地文件系统
insert overwrite local directory '/home/00'
select * from t_index;

---导出到本地文件系统并且指定分隔符
insert overwrite local directory '/home/01'
row format delimited fields terminated by ','
select * from t_index;

---导出hdfs
insert overwrite directory '/01'
row format delimited fields terminated by ','
select * from t_index;

1.4.10 EXPLAIN解析Hive

https://cwiki.apache.org/conf…

通过explain或者explain extended来查看执行打算。（通过sql语句翻译成mapreduce程序）
通过EXPLAIN来优化SQL语句

explain 
select * from t_buk;

explain extended
select * from t_buk;

explain select * from t_buk;

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: t_buk
          Statistics: Num rows: 12 Data size: 207 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: int), name (type: string), sex (type: string), money (type: double)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 12 Data size: 207 Basic stats: COMPLETE Column stats: NONE
            ListSink

explain extended select * from t_buk;

ABSTRACT SYNTAX TREE:

TOK_QUERY
   TOK_FROM
      TOK_TABREF
         TOK_TABNAME
            t_buk
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            TOK_ALLCOLREF


STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: t_buk
          Statistics: Num rows: 12 Data size: 207 Basic stats: COMPLETE Column stats: NONE
          GatherStats: false
          Select Operator
            expressions: id (type: int), name (type: string), sex (type: string), money (type: double)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 12 Data size: 207 Basic stats: COMPLETE Column stats: NONE
            ListSink

1.5、DQL

Where语句

SELECT * FROM sales WHERE amount > 10 AND region = "US"

DISTINCT 去重

hive> SELECT col1, col2 FROM t1
    1 3
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1, col2 FROM t1
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1 FROM t1
    1
    2

PARTITION 分区

See also Partition Filter Syntax.
See also Group By.
See also Sort By / Cluster By / Distribute By / Order By.

SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'

LIMIT 分页

SELECT * FROM customers LIMIT 5
SELECT * FROM customers ORDER BY create_date LIMIT 2,5

分组：GROUP BY

https://cwiki.apache.org/conf…

为了按性别计算不同用户的数量：（将查到的数据插入到表中）
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;


高级性能：多组分组插入
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
  SELECT pv_users.gender, count(DISTINCT pv_users.userid)
  GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
  SELECT pv_users.age, count(DISTINCT pv_users.userid)
  GROUP BY pv_users.age;

分桶：CLUSTER/DISRIBUTE BY

https://cwiki.apache.org/conf…

Cluster by (不能应用部分排序)
Distribute by … sort by … (sort部分排序，order by：全局排序)

语法(可用于数据导入)：
SELECT col1, col2 FROM t_stu CLUSTER BY col1;
SELECT col1, col2 FROM t_stu DISTRIBUTE BY col1    SORT BY col1 ASC, col2 DESC;

必须设置reduceTask个数（决定了分桶个数）：
set mapreduce.job.reduces=3;

排序：SORT/ORDER BY

sort by：部分排序
order by：全局排序

SELECT col1, col2 FROM t_stu ORDER BY col1 ASC, col2 DESC;
SELECT col1, col2 FROM t_stu DISTRIBUTE BY col1    SORT BY col1 ASC, col2 DESC;

连贯：INNER/LEFT/RIGHT/FULL JOIN

https://cwiki.apache.org/conf…

1.容许简单的连贯表达式
SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id)

2.能够在同一查问中退出两个以上的表
SELECT a.val, b.val, c.val 
FROM a 
JOIN b ON (a.key = b.key1) 
JOIN c ON (c.key = b.key2)

3.【key雷同视为一个作业】如果对于每个表在连接子句中应用雷同的列，Hive会将多个表上的连贯转换为单个map / reduce作业，
SELECT a.val, b.val, c.val 
FROM a 
JOIN b ON (a.key = b.key1) 
JOIN c ON (c.key = b.key1)

4.【key不同视为不同作业】转换为两个map / reduce作业，因为b中的key1列用于第一个连贯条件，而b中的key2列用于第二个连贯条件。第一个map / reduce作业与a连贯，而后在第二个map / reduce作业中将后果与c连贯。
SELECT a.val, b.val, c.val 
FROM a 
JOIN b ON (a.key = b.key1) 
JOIN c ON (c.key = b.key2)

5.【大表在最初】在连贯的每个map / reduce阶段中，序列中的最初一个表通过reducers流式传输，其余表在缓冲区中。因而，通过组织表使得最大的表呈现在序列的最初，有助于缩小reducer中用于缓冲连贯键的特定值的行所需的内存。例如in
SELECT a.val, b.val, c.val 
FROM a 
JOIN b ON (a.key = b.key1) 
JOIN c ON (c.key = b.key1)

6.【指定流式传输表】在连贯的每个map / reduce阶段，能够通过提醒指定要流式传输的表。例如in
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val 
FROM a 
JOIN b ON (a.key = b.key1) 
JOIN c ON (c.key = b.key1)

7.LEFT，RIGHT和FULL OUTER连贯，以便提供对无条件匹配的ON子句的更多管制。例如，这个查问：
SELECT a.id,a.name,b.id,b.name from t_a a left JOIN t_b b on a.id=b.id;
SELECT a.id,a.name,b.id,b.name from t_a a right JOIN t_b b on a.id=b.id;
SELECT a.id,a.name,b.id,b.name from t_a a full JOIN t_b b on a.id=b.id;

8.LEFT SEMI JOIN：左半连贯（只显示左表的数据）
SELECT a.key, a.value
FROM a
WHERE a.key in ( SELECT b.key FROM B );
能够写成：
SELECT a.key, a.val
FROM a 
LEFT SEMI JOIN b ON (a.key = b.key)

9.【尽量应用EXISTS】IN / EXISTS，如果应用EXISTS，子查问没有，父查问不会执行；而IN父查问仍然执行 
SELECT * 
FROM t_a a 
WHERE id EXISTS (select id from t_b b where b.id = a.id);

10.【MAPJOIN】如果两张表在连贯的列上都进行了“流式传输”，并且一个表中的桶数是另一个表中的桶数的倍数，那么能够应用MAPJOIN。
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a 
JOIN b ON a.key = b.key
#注：MAPJOIN不是默认行为，并由以下参数管制
set hive.optimize.bucketmapjoin = true

11.【MAPJOIN】如果两张表在连贯的列上都进行了“排序和分桶”，并且桶的数量一样，那么能够应用MAPJOIN。
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM A a 
JOIN B b ON a.key = b.key
#注：这不是默认行为，须要设置以下参数：
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;


12.【MAPJOIN限度】：以下状况不反对MAPJOIN
1.联结查问Union
2.横向视图Lateral View
3.Group By/Join/Sort By/Cluster By/Distribute By
4.MapJoin前面跟着Union
5.MapJoin前面跟着Join
6.MapJoin前面跟着MapJoin

优化：JOIN

https://cwiki.apache.org/conf…

联结：UNION

1.2.0之前：不会删除反复行
1.2.0之后：默认删除反复行，
ALL：不删除反复行， DISTINCT：删除反复行
每个select_statement返回的列的数量和名称必须雷同。否则，抛出架构谬误。

https://cwiki.apache.org/conf…

1.UNION例子
select_statement
UNION ALL
select_statement

2.FROM中的UNION
SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'
    UNION ALL
    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
 ) actions JOIN users u ON (u.id = actions.uid)

3.列别名
语句1：谬误
INSERT OVERWRITE TABLE target_table
  SELECT name, id, category FROM source_table_1
  UNION ALL
  SELECT name, id, "Category159" FROM source_table_2
  
语句2：正确
INSERT OVERWRITE TABLE target_table
  SELECT name, id, category FROM source_table_1
  UNION ALL
  SELECT name, id, "Category159" as category FROM source_table_2
  
  
4.列类型转换（和列别名差不多）
SELECT name, id, cast('2001-01-01' as date) d FROM source_table_1
UNION ALL
SELECT name, id, hiredate as d FROM source_table_2

取样：TABLESAMPLE

取样须要分桶

https://cwiki.apache.org/conf…

语法：
TABLESAMPLE (BUCKET x OUT OF y [ON colname])
x：第几桶，从1开始
y：总桶数
col：列


简略抽样：取第几桶数据
select * from t_buk tablesample(bucket 1 out of 3);
select * from t_buk tablesample(bucket 2 out of 3);


随机桶抽样：查问3桶中的第1桶，因为应用了rand（），会随机加桶。
select * from t_buk tablesample(bucket 1 out of 3 on rand());
select * from t_buk tablesample(bucket 2 out of 3 on rand());


固定桶抽样：取第几桶数据（实际效果和简略抽样统一）
select * from t_buk tablesample(bucket 1 out of 3 on id);
select * from t_buk tablesample(bucket 2 out of 3 on id);


压缩桶取样：取第几桶数据（所有桶的数据会依据桶的数量从新进行调配，n%2 ）
select * from t_buk tablesample(bucket 1 out of 2 on id);
select * from t_buk tablesample(bucket 2 out of 2 on id);


拉伸桶取样：取第几桶数据（所有桶的数据会依据桶的数量从新进行调配，n%4 ）
select * from t_buk tablesample(bucket 1 out of 4 on id);
select * from t_buk tablesample(bucket 2 out of 4 on id);


百分比取样：
select * from t_buk tablesample(10 percent);    // 12*0.1 = 1.2，有余数则 +1
select * from t_buk tablesample(20 percent);    // 12*0.2 = 2.4,


字节长度取样：
ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')
select * from t_buk tablesample(10b);

子查问：FROM、WHERE

FROM的子查问

1.语法：
SELECT ... FROM (subquery) table_name ...
SELECT ... FROM (subquery) AS table_name ...


2.例子：
SELECT col
FROM (
  SELECT a+b AS col
  FROM t1
) t2


3.子查问蕴含UNION ALL的示例：
SELECT t3.col
FROM (
  SELECT a+b AS col
  FROM t1
  UNION ALL
  SELECT c+d AS col
  FROM t2
) t3

WHERE的子查问

1.例子：
SELECT *
FROM A
WHERE A.a IN (SELECT foo FROM B);


2.其余受反对的类型是EXISTS和NOT EXISTS子查问：
SELECT A
FROM T1
WHERE EXISTS (SELECT B FROM T2 WHERE T1.X = T2.Y)

虚构列：Virtual Columns

虚构列
Hive 0.8.0反对两个虚构列：
一个是INPUT__FILE__NAME，它是mapper工作的输出文件名。
另一个是BLOCK__OFFSET__INSIDE__FILE，这是以后的全局文件地位。
对于块压缩文件，它是以后块的文件偏移量，它是以后块的第一个字节的文件偏移量。


从Hive 0.8.0开始，增加了以下虚构列：
    ROW__OFFSET__INSIDE__BLOCK
    RAW__DATA__SIZE
    ROW__ID
    GROUPING__ID


须要留神的是，此处列出的所有虚构列都不能用于任何其余目标（即，应用具备虚构列的列创立表将失败，并显示“SemanticException谬误10328：有效的列名...”）

简略的例子
从src中抉择INPUT__FILE__NAME，键BLOCK__OFFSET__INSIDE__FILE;
按键INPUT__FILE__NAME按程序从src组中选择键，count（）;
从src中抉择*，其中BLOCK__OFFSET__INSIDE__FILE> 12000按键排序;

内置函数【0】：UDF

https://cwiki.apache.org/conf…

运算符
内置函数UDF
内置聚合函数UDAF
内置表生成函数UDTF

show functions

hive> show functions;
OK
!
!=
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
and
array
array_contains
ascii
asin
assert_true
atan
avg
base64
between
bin
case
ceil
ceiling
coalesce
collect_list
collect_set
compute_stats
concat
concat_ws
context_ngrams
conv
corr
cos
count
covar_pop
covar_samp
create_union
cume_dist
current_database
current_date
current_timestamp
current_user
date_add
date_sub
datediff
day
dayofmonth
decode
default.row_num
degrees
dense_rank
div
e
elt
encode
ewah_bitmap
ewah_bitmap_and
ewah_bitmap_empty
ewah_bitmap_or
exp
explode
field
find_in_set
first_value
floor
format_number
from_unixtime
from_utc_timestamp
get_json_object
greatest
hash
hex
histogram_numeric
hour
if
in
in_file
index
initcap
inline
instr
isnotnull
isnull
java_method
json_tuple
lag
last_day
last_value
lcase
lead
least
length
like
ln
locate
log
log10
log2
lower
lpad
ltrim
map
map_keys
map_values
matchpath
max
min
minute
month
named_struct
negative
next_day
ngrams
noop
noopstreaming
noopwithmap
noopwithmapstreaming
not
ntile
nvl
or
parse_url
parse_url_tuple
percent_rank
percentile
percentile_approx
pi
pmod
posexplode
positive
pow
power
printf
radians
rand
rank
reflect
reflect2
regexp
regexp_extract
regexp_replace
repeat
reverse
rlike
round
row_number
rpad
rtrim
second
sentences
sign
sin
size
sort_array
space
split
sqrt
stack
std
stddev
stddev_pop
stddev_samp
str_to_map
struct
substr
substring
sum
tan
to_date
to_unix_timestamp
to_utc_timestamp
translate
trim
ucase
unbase64
unhex
unix_timestamp
upper
var_pop
var_samp
variance
weekofyear
when
windowingtablefunction
xpath
xpath_boolean
xpath_double
xpath_float
xpath_int
xpath_long
xpath_number
xpath_short
xpath_string
year
|
~

Function	Return Type	Argument Types	Description
unix_timestamp	timestamp	[date[, pattern]]	Returns the UNIX timestamp
upper	string	string	Returns str Returns str with all characters changed to uppercase
















upper

内置函数例子：

随机数：select rand();
随机0-100： select split(rand()*100,"\\.")[0];

四舍五入：
select round( rand()*100,0);
select round( rand()*100);   //0-100,保留一位小数
select round( rand()*100,2); //0-100,保留2位小数

select substring(rand()*100,0,2);  //截取0-2位
select substr(rand()*100,0,2);

select indexof(rang()*100,".");   //.的地位 ???
select locate(rand()*100,"."); ???

select regexp_replace("a.jpg","jpg","png");  //正则替换
select cast( 1 as double);  //类型强制转换
字符串连贯： CONCAT()， +

判断语句：
select tihuan
case
when 1=1 then "男"
when 1=2 then "女"
else "人妖"
end;

case vis.next_data_source
    when 'IPM'       then   case vis.source_type    when 'tvbar' then biz_ipm.name    when 'subject' then subj_ipm.name    when 'detail' then det_ipm.name end
    when 'VIS'       then   case vis.source_type    when 'tvbar' then biz.name        when 'subject' then subj.name        end
    when 'playproxy' then   case vis.source_type    when 'tvbar' then biz.name        when 'subject' then subj.name        end
    when 'EPG'       then   case vis.source_type    when 'tvbar' then biz.name        when 'subject' then subj.name        end
    end source_name,

case when hour(in_time) between 0 and 5 then 1 end 在xx到xx之间

SQL语句中的转换：查表u3的name和sex，如果sex为1，返回“男”；2返回“女”
select 
u.name, 
case u.sex when 1 then "男" when 2 then "女" else "人妖" end
from u3 u;


IF语句:
select if(1=1,"男","女")；
和下面成果一样
select
u.name,
if(u.sex=1,"男",if(u.sex=2,"女","人妖"))
from u3 u;


select concat("1","2","3");  //连贯字符串
select concat_ws("|","1","2","3"); //连贯，带分隔符


select length("dasdadada");  //字符串长度
select size(array(1,2,3));   //数组个数


select current_date         以后工夫：2019-07-11
select current_timestamp      以后工夫：2019-07-11 15:28:49.499

cast('1214234321' as int)
cast('1214234321' as timestamp)

unix_timestamp(date/timestamp);            字符串转工夫戳（数据库中的date、timestrap转毫秒工夫戳）
unix_timestamp(date, format);     字符串转工夫戳
###########
如果starttime在数据库中是以timestamp存储的: 显示2019-12-01 16:04:07.0
咱们应用的时候将starttime转为咱们在hive中可用的unix_timestamp


from_unixtime(date/timestamp)          工夫戳转字符串date
from_unixtime(date/timestamp, format)  工夫戳转字符串date
datediff(enddate, startdate)          取得相差天数，(date ,date) -> int
date_add(startdate, intdays)          取得减少后的工夫 date -> date
date_sub(startdate, intdays)          取得缩小后的工夫 date -> date
date_format(date/timestamp/string, fmt)    工夫格式化 date -> date


select datediff(current_date, '2019-01-01')
select datediff(date_format(current_date, '%Y-%m-%d'), '2019-07-10') //计算日期差


select datediff(MONTH,'2012/8/1','2012/8/20') as daysum  //计算相差月数
select datediff(DAY,'2012/8/1','2012/8/20') as daysum  //计算相差天数
select datediff(HOUR,'2012/8/1','2012/8/20') as daysum  //计算相差小时数
select datediff(MINUTE,'2012/8/1','2012/8/20') as daysum  //计算相差分钟数
select datediff(SECOND,'2012/8/1','2012/8/20') as daysum  //计算相差秒数
//注解：MONTH为月、DAY为查问天、HOUR为小时、MINUTE为分钟、SECOND为秒



FROM_UNIXTIME
================================
timestamp > 工夫字符串
hive> desc function from_unixtime;
from_unixtime(unix_time, format) - returns unix_time in the specified format date
参数：
工夫戳timestamp
返回值格局: 
'%Y-%m-%d %H:%i:%S'
'yyyy-MM-dd hh:ss:mm'
例如: 
mysql>SELECT FROM_UNIXTIME( 1249488000, 'yyyyMMdd' )                          ->20071120 
mysql>SELECT FROM_UNIXTIME( 1249488000, '%Y年%m月%d' )         ->2007年11月20 
from_unixtime(cast(t.update_time as bigint),'yyyy-MM-dd HH:mm:ss')

UNIX_TIMESTAMP
=================================
工夫字符串/date/datetime/timestamp > timestamp
hive> desc function UNIX_TIMESTAMP;
UNIX_TIMESTAMP([date[, pattern]]) - Returns the UNIX timestamp
与from_unixtime绝对正好相同的工夫函数 
若无参数调用，则返回一个 Unix timestamp ('1970-01-01 00:00:00' GMT 之后的秒数) 作为无符号整数。
若用date 来调用 UNIX_TIMESTAMP()，它会将参数值以'1970-01-01 00:00:00' GMT后的秒数的模式返回。date 能够是一个 DATE 字符串、一个 DATETIME字符串、一个 TIMESTAMP或一个当地工夫的YYMMDD 或YYYMMDD格局的数字。 
例如: 
hive> SELECT UNIX_TIMESTAMP() 后果：1249524739 
hive> SELECT UNIX_TIMESTAMP('2009-08-06') 后果：1249488000 
hive> SELECT UNIX_TIMESTAMP( date() ) 后果：1249488000
hive> SELECT UNIX_TIMESTAMP( 1249488000 ) 后果：1249488000


1、如果hive表中是varchar
unix_timestamp("2020-01-01 12:30:54")          -> 1249488000

2、如果hive表中是long
from_unixtime(dt*0.001, 'yyyy-MM-dd hh:ss:mm')   -> 2020-01-01 12:30:54
bigint -> varchar: from_unixtime(visplaylog_datetime/1000) log_datetime

3、如果hive表中是timestamp
unix_timestamp(1249488000)          -> 1249488000 (hive中和unix中的timestamp格局是不一样的，应用前必须先转化)

4、如果hive中的是date
unix_timestamp(date())                -> 1249488000

排序函数：UDF

row_number()：值雷同，名次减少
rank()：值雷同，并列，跳跃1224
desen_rank(): 值雷同，并列，不跳跃1223

数据：

01 gp1802 84
02 gp1801 84
03 gp1802 84
04 gp1802 84
05 gp1801 81
06 gp1802 81
07 gp1802 81
08 gp1801 81
09 gp1802 81
10 gp1802 81
11 gp1803 81
12 gp1802 89
13 gp1802 89
14 gp1802 89
15 gp1803 89
16 gp1802 91
17 gp1802 97
18 gp1802 72

建表

create table t_rank (
sid string,
class string,
score int
)
row format delimited
fields terminated by ' ';


导入数据：
load data local inpath '/home/rank.txt' into table t_rank;

1.5 需要：统计每个班级的每个学生问题（学生问题依照降序排列）

select * from t_rank group by class order by score;            不行的，分组查问要和聚合函数联用

e.g.
select class, sid, score,
row_number() over(distribute by class sort by score desc),  
rank() over(distribute by class sort by score desc),  
dense_rank() over(distribute by class sort by score desc) 
from t_rank;

1.6 需要：统计每个班级问题前三的同学有哪些

e.g.
select a.* from
(
select class, sid, score,
row_number() over(distribute by class sort by score desc) rank
from t_rank
) a
where rank <= 3;

自定义函数：UDF

UDF的evaluate办法的上下文是一次一行。

1、创立Maven我的项目

maven依赖pom.xml:

  <dependencies>
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>1.2.1</version>
    </dependency>
      <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <version>4.12</version>
          <scope>compile</scope>
      </dependency>
  </dependencies>

2、创立java类，继承UDF

public class Birth2Age extends UDF {
    /**
     * 输出生日，返回年龄
     * 
     * evaluate：办法名称必须是这个，然而能够重载
     *
     * @param birthage ：格局为yyyy-MM-dd (eg.2018-08-08)
     * @return
     */
    public int evaluate(String birthage) throws ParseException {
        if (birthage == null){
            return 0;
        }
        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
        Date date_bir = format.parse(birthage);

        Calendar calendar_cur = Calendar.getInstance();
        Date date_cur = new Date();
        calendar_cur.setTimeInMillis(date_cur.getTime());
        int year_cur = calendar_cur.get(Calendar.YEAR);

        Calendar calendar_bir = Calendar.getInstance();
        calendar_bir.setTimeInMillis(date_bir.getTime());
        int year_bir = calendar_bir.get(Calendar.YEAR);
        return year_cur-year_bir;
    }
}

3、maven打包jar

package双击

4、jar包上传到hdfs

FileZilla..

5、将jar包增加在hive中

hive> add jar /home/udf.jar;  门路不加单引号

6、创立长期函数

语法：
CREATE TEMPORARY FUNCTION function_name AS class_name;

创立长期函数：
hive> create temporary function udf_to_upper as 'com.qf.udf.TOUpperCase'

删除长期函数：
DROP TEMPORARY FUNCTION [IF EXISTS] function_name;

7、创立永恒函数

语法：
CREATE FUNCTION [db_name.]function_name AS class_name
  [USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];


1.将jar包上传到hdfs任意地位
2.调用命令
hive> 
CREATE FUNCTION udf_Birth2Age AS 'com.qf.Birth2Age'
USING JAR 'hdfs://y51:9000/udf_Birth2Age.jar';

CREATE FUNCTION udf_JsonSerDe AS 'org.openx.data.jsonserde.JsonSerDe'
USING JAR 'hdfs://y51:9000/home/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar';


删除函数：
hive> DROP FUNCTION udf_Birth2Age;


从新加载函数：
hive> RELOAD FUNCTION;

8、其它形式加载函数

0、配置jar目录
<property>
    <name>hive.aux.jars.path</name>
    <value>$HIVE_HOME/auxlib/</value>
<property>


1、单次应用
hive> add jar /home/udf.jar;  门路不加单引号
hive> CREATE TEMPORARY FUNCTION function_name AS class_name;


2、在任意目录创立hive-init文件，并增加如下内容
    add jar /home/udf.jar;
    CREATE TEMPORARY FUNCTION udf_toUpperCase AS 'cn.qphone.udf.UDF_ToUpperCase';
hive> hive -i hive-init (每次启动hive的时候手动执行一遍)


3、在hive的conf目录下创立.hiverc文件，并增加如下内容（在hive启动时会主动执行）
    add jar /home/udf.jar;
    CREATE TEMPORARY FUNCTION udf_toUpperCase AS 'cn.qphone.udf.UDF_ToUpperCase';


4、永恒加载
参考7、创立永恒函数

侧面视图（爆炸）：Lateral View

https://cwiki.apache.org/conf…(array)

数组:explode (array)：

explode将数组（或地图）作为输出，并将数组（map）的元素作为独自的行输入

数组:explode (array)
select explode(array('A','B','C'));
select explode(array('A','B','C')) as col;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;

col
A
B
C

汇合: explode (map)

汇合: explode (map)
select explode(map('A',10,'B',20,'C',30));
select explode(map('A',10,'B',20,'C',30)) as (key,value);
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;

key	value
A	10
B	20
C	30

posexplode (array)：

posexplode相似于explode`但它不是仅仅返回数组的元素，而是返回元素以及它在原始数组中的地位。

select posexplode(array('A','B','C'));
select posexplode(array('A','B','C')) as (pos,val);
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;

pos	val
0	A
1	B
2	C

1.6 分辨别桶

1.1 什么是分区

1、为什么要应用分区？
在查问hive数据时，都是通过扫描目录中所有的文件，随着咱们的文件数量越来越多，这会导致咱们的查问效率越来越低。

2、什么是分区？
mapreduce中的分区是将不同的数据别离写入不同的【文件】中，而hive的分区则是一个目录，每个目录会将寄存不同的数据。（例：将每年的数据别离存储在不同的目录下）

1.2 动态分区

须要手动指定分区列表

1.2.1 创立一级分区表

（1）创立分区表：
1.创立分区的时候必须指定一个分区列（站位列，不是真的列）
==============================================
create table t_part(
id int,
name string,
sex string,
money double
)
partitioned by(location string)
row format delimited
fields terminated by '\t';


create table t_part(
id int,
country string
)
partitioned by(location string)
row format delimited
fields terminated by '\t';


（2）一级分区导入数据:
==============================================
语法：
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
导入数据
load data local inpath '/home/user_china.txt' into table t_part partition(location='china')
load data local inpath '/home/user_japan.txt' into table t_part partition(location='japan')

1.2.2 创立二级分区表

create table t_part2(
id int,
name string,
sex string,
money double
)
partitioned by(year string, month string)
row format delimited
fields terminated by '\t'
stored as textfile;

（2）二级分区导入数据:
==============================================
语法：
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
导入数据
load data local inpath '/home/user-2019-01-12.txt' into table t_part2 partition(year='2019',month='01')
load data local inpath '/home/user-2019-11-11.txt' into table t_part2 partition(year='2019',month='11')
load data local inpath '/home/user-2019-12-12.txt' into table t_part2 partition(year='2019',month='12')

1.3 动静分区

会动静生成分区目录（动静分区会调用mapreduce执行分区操作，会耗费更多资源）

1.3.1 批改hive-site.xml

1、批改hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
  </property>
  
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
  </property>desc
  
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  
  
   <!-- 开启hive的本地模式  -->
  <property>
    <name>hive.exec.mode.local.auto</name>
    <value>true</value>
  </property>

 <!-- 开启你以后库的显示 -->
 <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
  </property>

 <!-- 开启动静分区  -->、
 <property>
    <name>hive.exec.dynamic.partition</name>
    <value>true</value>
  </property>
 <!-- 开启动静分区的模式  -->
 <property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
  </property>
  
  
</configuration>

1.3.2 创立分区表

create table t_dynamic_part (
id int,
name string,
sex string,
money double
)
partitioned by(location string)
row format delimited
fields terminated by '\t';

1.3.3 动静分区导入数据

语法：
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)];

导入数据：
insert into t_dynamic_part  partition(location)
select * from t_part;

二、分桶

2.1 分桶如何晋升性能

论断：防止笛卡尔积，大量缩小运算次数。

table_1:
a    1
b    2
c    3
d    4
e    5
f    6

table_2:
1    xx
2    xx
3    xx
4    xx
5    xx
6    xx

分桶：(规定 n%3)
a    1            b    2            c    3
d    4            e    5            f    6
============================================
1    xx            2    xx            3    xx
4    xx            5    xx            6    xx


个别状况：当table_1和table_2进行join操作时，table_1会和table_2的每个数据进行匹配。
分桶之后：当table_1和table_2进行join操作时，table_1的第一个桶会和table_2的第一个桶进行join
原始计算次数：6*6=36
分桶后计算次数：2*2 + 2*2 + 2*2 = 12

2.2 创立分桶表

创立分桶表：
create table t_buk(
id int,
name string,
sex string,
money double
)
clustered by(id) sorted by(id asc) into 3 buckets
row format delimited
fields terminated by '\t';


导入数据：
1    lixi    man    1000.0
2    lixi    man    1000.0
3    lixi    man    1000.0
4    lixi    man    1000.0
5    lixi    man    1000.0
6    lixi    man    1000.0
7    lixi    man    1000.0
8    lixi    man    1000.0
9    lixi    man    1000.0
10    lixi    man    1000.0
11    lixi    man    1000.0
12    lixi    man    1000.0
insert into t_buk select * from  t_part cluster by id; -- cluster自带全局排序
或
insert into t_buk select * from t_stu distribute by id sort by name asc, age desc; -- distribute可自定义

2.3 分桶查问

Cluster by (在查问数据时，不能应用部分排序sort by)
Distribute by … sort by … (sort部分排序，order by：全局排序)

语法(可用于数据导入)：
SELECT col1, col2 FROM t_stu CLUSTER BY col1;
SELECT col1, col2 FROM t_stu DISTRIBUTE BY col1    SORT BY col1 ASC, col2 DESC;

必须设置reduceTask个数（决定了分桶个数）：
set mapreduce.job.reduces=3;

2.4 抽样查问

tablesample

语法：
TABLESAMPLE (BUCKET x OUT OF y [ON colname])
x：第几桶，从1开始
y：总桶数
col：列


简略抽样：取第几桶数据
select * from t_buk tablesample(bucket 1 out of 3);
select * from t_buk tablesample(bucket 2 out of 3);


随机桶抽样：查问3桶中的第1桶，因为应用了rand（），会随机加桶。
select * from t_buk tablesample(bucket 1 out of 3 on rand());
select * from t_buk tablesample(bucket 2 out of 3 on rand());


固定桶抽样：取第几桶数据（实际效果和简略抽样统一）
select * from t_buk tablesample(bucket 1 out of 3 on id);
select * from t_buk tablesample(bucket 2 out of 3 on id);


压缩桶取样：取第几桶数据（所有桶的数据会依据桶的数量从新进行调配，n%2 ）
select * from t_buk tablesample(bucket 1 out of 2 on id);
select * from t_buk tablesample(bucket 2 out of 2 on id);


拉伸桶取样：取第几桶数据（所有桶的数据会依据桶的数量从新进行调配，n%4 ）
select * from t_buk tablesample(bucket 1 out of 4 on id);
select * from t_buk tablesample(bucket 2 out of 4 on id);


百分比取样：
select * from t_buk tablesample(10 percent);    // 12*0.1 = 1.2，有余数则 +1
select * from t_buk tablesample(20 percent);    // 12*0.2 = 2.4,


字节长度取样：
ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')
select * from t_buk tablesample(10b);

三、分辨别桶联合

1.创立原始表(长期表)
分区列要放在最初地位
===========================================
create TEMPORARY table t_com(
id int,
name string,
money double,
sex int
)
row format delimited
fields terminated by '\t'
stored as textfile;


2、加载数据
===========================================
1    lixi    1000.0    1
2    lixi    1000.0    1
3    lixi    1000.0    1
4    lixi    1000.0    1
5    lixi    1000.0    1
6    lixi    1000.0    1
7    lixi    1000.0    1
8    lixi    1000.0    1
9    lixi    1000.0    1
10    lixi    1000.0    0
11    lixi    1000.0    0
12    lixi    1000.0    0
load data local inpath '/home/log.txt' into table t_com;


3、创立分桶分区表
===========================================
- 以sex性别分区（分区列是伪列，不写在外面）
- 分4桶
create table t_part_buk(
id int,
name string,
money double
)
partitioned by(sex int)
clustered by(id) sorted by(id asc) into 4 buckets
row format delimited
fields terminated by '\t'
stored as textfile;




create table t_buk(
id int,
country string
)
partitioned by(location string)
clustered by(id) sorted by(id asc) into 3 buckets
row format delimited
fields terminated by '\t'
stored as textfile;


4、导入数据
tip: 必须设置reduceTask个数（决定了分桶个数）：set mapreduce.job.reduces=4;
===========================================
insert into t_part_buk partition(sex)
select id,country from t_com distribute by id sort by id asc;


5、查问后果：
===========================================
> select * from t_part_buk;
OK
12      lixi    1000.0  0
10      lixi    1000.0  0
11      lixi    1000.0  0
4       lixi    1000.0  1
8       lixi    1000.0  1
1       lixi    1000.0  1
5       lixi    1000.0  1
9       lixi    1000.0  1
2       lixi    1000.0  1
6       lixi    1000.0  1
3       lixi    1000.0  1
7       lixi    1000.0  1
每个区都会有4个桶

1.7 存储格局

Hive数据存储格局

https://cwiki.apache.org/conf…

Storage Formats

textfile：默认的存储格局；一般文本文件，数据不会压缩
sequencefile：hive提供的一种二进制存储格局，自带压缩，能够切割
rcfile：一种行列混合存储格局，会将相近的行与列放在一起，存储耗时，查问较快，自带压缩。
orc：是rcfile的一种优化
parquet：自定义输入输出格局

serde记录格局

1、常见的serde

ser：serialiler序列化
de：deserialilzer反序列化
csv、tsv、json serde、regex、parqut

2、创立csv格局的表

创立csv表：
create table csv1(
name string,
score string
)
row format serde 'org.apache.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"="|",
"quoteChar"="'",
"esacpeChar"="\\"
)
stored as textfile
;


导入数据：
LOAD DATA LOCAL INPATH '/HOME/CSV.txt' INTO TABLE csv1;

3、json serde

1.增加第三方jar包
hive> add jar /home/json-serde-1.3-jar-with-dependencies.jar;


2.创立表
create table json1(
name string,
score array<map<string,string>>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
stored as textfile
;


3.导入数据
LOAD DATA LOCAL INPATH '/home/json.txt' INTO TABLE json1;

4、regex serde

创立表：
create table regex1(
host string,
dt string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties(
    "input.regex"="^([0-9]{1,3}.[0-9]{1,3}.[0-9]{2,3} (.*)$)"
)
stored as textfile
;


导入数据：
192.168.120.100 [31/Jan/2012:00:02:02 +0800]
192.168.120.100 [31/Jan/2012:00:02:02 +0800]
192.168.120.100 [31/Jan/2012:00:02:02 +0800]
LOAD DATA LOCAL INPATH '/home/regex.txt' INTO TABLE regex1;

1.8、显示

Show Databases【显示所有数据库】
Show Tables/Views/Partitions/Indexes
        Show Tables【显示所有表】
        Show Views【显示所有视图】
        Show Partitions【显示分区】
        Show Create Table【显示建表语句】
        Show Indexes【显示索引】
Show Functions【显示所有函数】

1.8.1 显示所有数据库

SHOW DATABASES

1.8.2 显示所有表

SHOW TABLES

1.8.3 显示表所有视图

SHOW VIEWS;                                -- show all views in the current database
SHOW VIEWS 'test_*';                       -- show all views that start with "test_"
SHOW VIEWS '*view2';                       -- show all views that end in "view2"
SHOW VIEWS LIKE 'test_view1|test_view2';   -- show views named either "test_view1" or "test_view2"
SHOW VIEWS FROM test1;                     -- show views from database test1
SHOW VIEWS IN test1;                       -- show views from database test1 (FROM and IN are same)
SHOW VIEWS IN test1 "test_*";              -- show views from database test2 that start with "test_"

1.8.4 显示表所有分区

SHOW PARTITIONS table_name;

1.8.5 显示表DDL语句

SHOW CREATE TABLE ([db_name.]table_name|view_name);

1.8.6 显示表所有索引

SHOW [FORMATTED] (INDEX|INDEXES) ON table_with_index [(FROM|IN) db_name];

1.8.7 显示所有函数

SHOW FUNCTIONS "a.*";

1.9 压缩

在Hive中对两头数据或最终数据做压缩，是进步数据吞吐量和性能的一种伎俩。对数据做压缩，能够大量缩小磁盘的存储空间，比方基于文本的数据文件，能够将文件压缩40%或更多。同时压缩后的文件在磁盘间传输和I/O也会大大减少；当然压缩和解压缩也会带来额定的CPU开销，然而却能够节俭更多的I/O和应用更少的内存开销。

1.9.1 压缩模式评估

评估规范：

压缩比：压缩比越高，压缩后文件越小，所以压缩比越高越好。
压缩工夫：越快越好。
曾经压缩的格式文件是否能够再宰割：能够宰割的格局容许繁多文件由多个Mapper程序处理，能够更好的并行化。

1.9.2 压缩模式比照

BZip2有最高的压缩比但也会带来更高的CPU开销，Gzip较BZip2次之。如果基于磁盘利用率和I/O思考，这两个压缩算法都是比拟有吸引力的算法。
LZO和Snappy算法有更快的解压缩速度，如果更关注压缩、解压速度，它们都是不错的抉择。 LZO和Snappy在压缩数据上的速度大抵相当，但Snappy算法在解压速度上要较LZO更快。
Hadoop的会将大文件宰割成HDFS block(默认64MB)大小的splits分片，每个分片对应一个Mapper程序。在这几个压缩算法中 BZip2、LZO、Snappy压缩是可宰割的，Gzip则不反对宰割。

1.9.3 常见压缩格局

压缩形式	压缩后大小	压缩速度	是否可宰割
GZIP	中	中	否
BZIP2	小	慢	是
LZO	大	快	是
Snappy	大	快	是

1.9.4 Hadoop编码解码

压缩格局	对应的编码/解码器
DEFAULT	org.apache.hadoop.io.compress.DefaultCodec
Gzip	org.apache.hadoop.io.compress.GzipCodec
Bzip	org.apache.hadoop.io.compress.BzipCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec（两头压缩应用）
Lzo	com.hadoop.compression.lzo.LzoCodec（两头压缩应用）

1.9.5 什么是宰割

在思考如何压缩那些将由MapReduce解决的数据时，思考压缩格局是否反对宰割是很重要的。思考存储在HDFS中的未压缩的文件，其大小为1GB，HDFS的块大小为64MB，所以该文件将被存储为16块，将此文件用作输出的MapReduce作业会创立1个输人分片（split，也称为“分块”。对于block，咱们对立称为“块”。）每个分片都被作为一个独立map工作的输出独自进行解决。
当初假如，该文件是一个gzip格局的压缩文件，压缩后的大小为1GB。和后面一样，HDFS将此文件存储为16块。然而，针对每一块创立一个分块是没有用的，因为不可能从gzip数据流中的任意点开始读取，map工作也不可能独立于其余分块只读取一个分块中的数据。gzip格局应用DEFLATE来存储压缩过的数据，DEFLATE将数据作为一系列压缩过的块进行存储。问题是，每块的开始没有指定用户在数据流中任意点定位到下一个块的起始地位，而是其本身与数据流同步。因而，gzip不反对宰割(块)机制。
在这种状况下，MapReduce不宰割gzip格局的文件，因为它晓得输出是gzip压缩格局的(通过文件扩展名得悉)，而gzip压缩机制不反对宰割机制。因而一个map工作将解决16个HDFS块，且大都不是map的本地数据。与此同时，因为map工作少，所以作业宰割的粒度不够细，从而导致运行工夫变长。

1.9.6 两头压缩

hive.exec.compress.intermediate：默认该值为false，设置为true为激活两头数据压缩性能。HiveQL语句最终会被编译成Hadoop的Mapreduce job，开启Hive的两头数据压缩性能，就是在MapReduce的shuffle阶段对mapper产生的两头后果数据压缩。在这个阶段，优先选择一个低CPU开销的算法。
mapred.map.output.compression.codec：该参数是具体的压缩算法的配置参数，SnappyCodec比拟适宜在这种场景中编解码器，该算法会带来很好的压缩性能和较低的CPU开销。设置如下：

set hive.exec.compress.intermediate=true
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

1.9.7 最终压缩

hive.exec.compress.output：用户能够对最终生成的Hive表的数据通常也须要压缩。该参数管制这一性能的激活与禁用，设置为true来申明将后果文件进行压缩。
mapred.output.compression.codec：将hive.exec.compress.output参数设置成true后，而后抉择一个适合的编解码器，如抉择SnappyCodec。设置如下：

set hive.exec.compress.output=true 
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

1.9.8 TETFILE应用压缩

Hive数据表的默认格局，存储形式：行存储。
能够应用Gzip压缩算法，但压缩后的文件不反对split
在反序列化过程中，必须一一字符判断是不是分隔符和行结束符，因而反序列化开销会比SequenceFile高几十倍。

建表语句：

${建表语句}
stored as textfile;
##########################################插入数据########################################
set hive.exec.compress.output=true; --启用压缩格局
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  --指定输入的压缩格局为Gzip 
set mapred.output.compress=true;   
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;     
insert overwrite table textfile_table select * from T_Name;

1.9.9 SEQUENCEFILE应用压缩

压缩数据文件能够节俭磁盘空间，但Hadoop中有些原生压缩文件的毛病之一就是不反对宰割。反对宰割的文件能够并行的有多个mapper程序处理大数据文件，大多数文件不反对可宰割是因为这些文件只能从头开始读。Sequence File是可宰割的文件格式，反对Hadoop的block级压缩。
Hadoop API提供的一种二进制文件，以key-value的模式序列化到文件中。存储形式：行存储。
sequencefile反对三种压缩抉择：NONE，RECORD，BLOCK。Record压缩率低，RECORD是默认选项，通常BLOCK会带来较RECORD更好的压缩性能。
劣势是文件和hadoop api中的MapFile是互相兼容的

建表代码

${建表语句}
SORTED AS SEQUENCEFILE;    --将Hive表存储定义成SEQUENCEFILE
##########################################插入数据########################################
set hive.exec.compress.output=true; --启用压缩格局
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; --指定输入的压缩格局为Gzip 
set mapred.output.compression.type=BLOCK;   --压缩选项设置为BLOCK
set mapred.output.compress=true; 
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; 
insert overwrite table textfile_table select * from T_Name;

1.9.10 RCFILE应用压缩

存储形式：数据按行分块，每块按列存储。联合了行存储和列存储的长处：

首先，RCFile 保障同一行的数据位于同一节点，因而元组重构的开销很低
其次，像列存储一样，RCFile 可能利用列维度的数据压缩，并且能跳过不必要的列读取
数据追加：RCFile不反对任意形式的数据写操作，仅提供一种追加接口，这是因为底层的 HDFS以后仅仅反对数据追加写文件尾部。
行组大小：行组变大有助于进步数据压缩的效率，然而可能会侵害数据的读取性能，因为这样减少了 Lazy 解压性能的耗费。而且行组变大会占用更多的内存，这会影响并发执行的其余MR作业。思考到存储空间和查问效率两个方面，Facebook 抉择 4MB 作为默认的行组大小，当然也容许用户自行抉择参数进行配置。

建表代码

${建表语句}
stored as rcfile;
-插入数据操作：
set hive.exec.compress.output=true; 
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; 
set mapred.output.compress=true; 
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; 
insert overwrite table rcfile_table select * from T_Name;

1.9.11 ORC应用压缩

存储形式：数据按行分块，每块依照列存储。
压缩快，疾速列存取。效率比rcfile高，是rcfile的改进版本。

1.9.12 压缩小结

TextFile默认格局，加载速度最快，能够采纳Gzip进行压缩，压缩后的文件无奈split，即并行处理。
SequenceFile压缩率最低，查问速度个别，将数据寄存到sequenceFile格局的hive表中，这时数据就会压缩存储。三种压缩格局NONE，RECORD，BLOCK。是可宰割的文件格式。
RCfile压缩率最高，查问速度最快，数据加载最慢。
相比TEXTFILE和SEQUENCEFILE，RCFILE因为列式存储形式，数据加载时性能耗费较大，然而具备较好的压缩比和查问响应。数据仓库的特点是一次写入、屡次读取，因而，整体来看，RCFILE相比其余两种格局具备较显著的劣势。
在hive中应用压缩须要灵便的形式，如果是数据源的话，采纳RCFile+bz或RCFile+gz的形式，这样能够很大水平上节俭磁盘空间；而在计算的过程中，为了不影响执行的速度，能够节约一点磁盘空间，倡议采纳RCFile+snappy的形式，这样能够整体晋升hive的执行速度。至于lzo的形式，也能够在计算过程中应用，只不过综合思考（速度和压缩比）还是思考snappy合适。

https://blog.csdn.net/xsdxs/a…

压缩 compression

1.9.1 map阶段的压缩

set hive.exec.compress.output=false
set hive.exec.compress.intermediate=false
set hive.intermediate.compression.codec
set hive.intermediate.compression.type

1.9.2 reduce阶段的压缩

set hive.exec.compress.output=false
set hive.exec.compress.intermediate=false
set hive.intermediate.compression.codec
set hive.intermediate.compression.type

1.9.3 自定义压缩

INPUTFORMAT and OUTPUTFORMAT

in the file_format to specify the name of a corresponding InputFormat and OutputFormat class as a string literal.

For example, 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. 

For LZO compression, the values to use are 
'INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" 
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"' 

(see LZO Compression).


在file_format中指定相应的InputFormat和OutputFormat类的名称作为字符串文本。
例如,“org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat”。
对于LZO压缩，要应用的值是
INPUTFORMAT“com.hadoop.mapred.DeprecatedLzoTextInputFormat”
OUTPUTFORMAT“org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat”
(见LZO压缩)。

1.9.4 压缩比例

压缩算法	TEXTFILE	SEQUENCFILE	RCFILE	ORC
不压缩	119.2G	54.1G	20.0G	98G
snappy	30.2G	23.6G	13.6G	27.0G
Gzip	18.8G	14.1G	不反对	15.2G
ZLIB	不反对	不反对	10.1G	不反对

1.10 窗口函数

1.10.1 SUM OVER

现有 hive 表 cookie1, 内容如下:
hive> select * from cookie1;
cookie1 2015-04-10  1
cookie1 2015-04-11  5
cookie1 2015-04-12  7
cookie1 2015-04-13  3
cookie1 2015-04-14  2
cookie1 2015-04-15  4
cookie1 2015-04-16  4

（1）对某个字段全副数据进行累加例如，求cookie1的总pv值

partition by cookieid : 依照cookieid分组

select cookieid,createtime,pv, 
sum(pv) over(partition by cookieid) as sum1 
from cookie1; 

后果:
cookie1 2015-04-16  4   26
cookie1 2015-04-15  4   26
cookie1 2015-04-14  2   26
cookie1 2015-04-13  3   26
cookie1 2015-04-12  7   26
cookie1 2015-04-11  5   26
cookie1 2015-04-10  1   26

(2) 从第一行开始累加到以后行

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid order by createtime) as sum2
from cookie1;

或：以上SQL语句是上面这个SQL语句的简写:
select cookieid,createtime,pv, 
sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as sum3
from cookie1;

# 后果
cookie1 2015-04-10  1   1
cookie1 2015-04-11  5   6
cookie1 2015-04-12  7   13
cookie1 2015-04-13  3   16
cookie1 2015-04-14  2   18
cookie1 2015-04-15  4   22
cookie1 2015-04-16  4   26

根本的语法是：

unbounded : 有限
preceding : 往前
following : 往后
between...end... : 指定操作的范畴
current row : 以后行，计算到哪一行就是哪一行
between unbounded preceding and current row : 从终点到以后行，往前(对应表中的从上到下的程序)累加
默认就是从终点到以后行往前累加，所以between unbounded preceding and current row这个条件能够不写。

##################################################

ROWS BETWEEN  1 AND 2 （从行1到行2）
ROWS BETWEEN  2 preceding AND 3 following （前2行 到 后3行）
ROWS BETWEEN  2 preceding AND current row （前2行 到 以后行）
ROWS BETWEEN  unbounded preceding AND current row （第一行 到 以后行）
ROWS BETWEEN  current row AND unbounded following （以后行 到 最初一行）

工夫点别离能够是以以后行作为参考系，后面几行n PRECEDING或者是前面几行n FOLLOWING，也能够是以后行CURRENT ROW。总之能够设想有一个滑动窗口，咱们能够规定一个滑动窗口的核心地位和大小，而后每次画过一个步长，计算一次窗口内的值。

(3) 计算分组内以后行到残余所有行的和

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as sum4
from cookie1; 

后果剖析 : 
cookie1 2015-04-10  1   26  (1+5+7+3+2+4+4=26)
cookie1 2015-04-11  5   25  (5+7+3+2+4+4=25)
cookie1 2015-04-12  7   20  (7+3+2+4+4=20)
cookie1 2015-04-13  3   13  (3+2+4+4=13)
cookie1 2015-04-14  2   10  (2+4+4=10)
cookie1 2015-04-15  4   8   (4+4=8)
cookie1 2015-04-16  4   4   (4)

(4) 指定计算的范畴

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as sum5
from cookie1; 

后果剖析 : 
cookie1 2015-04-10  1   1   (这一行后面没有记录，只能取自身的值)
cookie1 2015-04-11  5   6   (这一行后面有1，不够取3行的值，只能取一行与自身相加，取5+1=6)
cookie1 2015-04-12  7   13  (这一行后面有1，5，不够取3行的值，只能取两行与自身相加，取7+5+1=13)
cookie1 2015-04-13  3   16  (这一行后面有3行能够取，顺次往回取三行，取3+7+5+1=16)
cookie1 2015-04-14  2   17  (同理，取2+3+7+5=17)
cookie1 2015-04-15  4   16  (同理，取4+2+3+7=16)
cookie1 2015-04-16  4   13  (同理，取4+4+2+3=13)

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as sum6
from cookie1; 

后果剖析 : 
cookie1 2015-04-10  1   6   (这一行后面没有记录，前面能够取一条记录，1+5=6)
cookie1 2015-04-11  5   13  (这一行后面只能取一条记录，前面能够取一条记录，1+5+7=13)
cookie1 2015-04-12  7   16  (这一行后面只能取两条记录，前面能够取一条记录，1+5+7+3=16)
cookie1 2015-04-13  3   18  (这一行后面能够取三条记录，前面能够取一条记录，1+5+7+3+2=18)
cookie1 2015-04-14  2   21  (同理，5+7+3+2+4=21) 
cookie1 2015-04-15  4   20  (同理，7+3+2+4+4=20)
cookie1 2015-04-16  4   13  (这一行后面能够取三条记录，前面没有记录，3+2+4+4=13)

1.10.2 MIN OVER

sum、avg、max、min这些窗口函数的语法都是一样的，以下用一个SQL语句来演示 :

select cookieid, createtime, pv,
min(pv) over(partition by cookieid) as min1,
min(pv) over(partition by cookieid order by createtime) as min2, 
min(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as min3,
min(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as min4,
min(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as min5,
min(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as min6
from cookie1;

后果 : 
cookie1 2015-04-10  1   1   1   1   1   1   1
cookie1 2015-04-11  5   1   1   1   2   1   1
cookie1 2015-04-12  7   1   1   1   2   1   1
cookie1 2015-04-13  3   1   1   1   2   1   1
cookie1 2015-04-14  2   1   1   1   2   2   2
cookie1 2015-04-15  4   1   1   1   4   2   2
cookie1 2015-04-16  4   1   1   1   4   2   2

1.10.3 MAX OVER

select cookieid,createtime,pv,
max(pv) over(partition by cookieid) as max1,
max(pv) over(partition by cookieid order by createtime) as max2,
max(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as max3,
max(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as max4,
max(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as max5,
max(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as max6
from cookie1;

后果 : 
cookie1 2015-04-10  1   7   1   1   7   1   5
cookie1 2015-04-11  5   7   5   5   7   5   7
cookie1 2015-04-12  7   7   7   7   7   7   7
cookie1 2015-04-13  3   7   7   7   4   7   7
cookie1 2015-04-14  2   7   7   7   4   7   7
cookie1 2015-04-15  4   7   7   7   4   7   7
cookie1 2015-04-16  4   7   7   7   4   4   4

1.10.4 AVG OVER

select cookieid,createtime,pv,
avg(pv) over(partition by cookieid) as avg1,
avg(pv) over(partition by cookieid order by createtime) as avg2,
avg(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as avg3,
avg(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as avg4,
avg(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as avg5,                
avg(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as avg6
from cookie1;

后果 : 
cookie1 2015-04-10  1       3.7142857142857144      1.0                 1.0                 3.7142857142857144      1.0                 3.0
cookie1 2015-04-11  5       3.7142857142857144      3.0                 3.0                 4.166666666666667       3.0                 4.333333333333333
cookie1 2015-04-12  7       3.7142857142857144      4.333333333333333   4.333333333333333   4.0                     4.333333333333333   4.0
cookie1 2015-04-13  3       3.7142857142857144      4.0                 4.0                 3.25                    4.0                 3.6
cookie1 2015-04-14  2       3.7142857142857144      3.6                 3.6                 3.3333333333333335      4.25                4.2
cookie1 2015-04-15  4       3.7142857142857144      3.6666666666666665  3.6666666666666665  4.0                     4.0                 4.0
cookie1 2015-04-16  4       3.7142857142857144      3.7142857142857144  3.7142857142857144  4.0                     3.25                3.25

1.10.5 ROW_NUMBER OVER

row_number()：值雷同，名次减少

1.10.6 RANK OVER

rank()：值雷同，并列，跳跃2225

1.10.7 DENSE_RANK OVER

desen_rank(): 值雷同，并列，不跳跃2223

1.10.8 LAG OVER

这两个函数为罕用的窗口函数,能够返回高低数据行的数据.

以咱们的订单表为例,如果咱们想要查看顾客上次的购买工夫能够这样去查问

SQL
=================================================
select name,orderdate,cost,
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1,
lag(orderdate,2) over (partition by name order by orderdate) as time2
from t_window;

查问后的数据为:
=================================================
name    orderdate   cost    time1   time2
jack    2015-01-01  10  1900-01-01  NULL
jack    2015-01-05  46  2015-01-01  NULL
jack    2015-01-08  55  2015-01-05  2015-01-01

jack    2015-02-03  23  2015-01-08  2015-01-05
jack    2015-04-06  42  2015-02-03  2015-01-08
mart    2015-04-08  62  1900-01-01  NULL
mart    2015-04-09  68  2015-04-08  NULL
mart    2015-04-11  75  2015-04-09  2015-04-08
mart    2015-04-13  94  2015-04-11  2015-04-09

总结：
=================================================
time1取的为依照name进行分组,分组内升序排列,取上一行数据的值.
time2取的为依照name进行分组，分组内升序排列,取下面2行的数据的值,留神当lag函数为设置行数值时,默认为1行.未设定取不到时的默认值时,取null值.

lead函数与lag函数方向相同,取向下的数据.

1.10.9 LEAD OVER

lead函数与lag函数方向相同,取向下的数据.

1.10.10 FIRST_VALUE OVER

first_value取分组内排序后，截止到以后行，第一个值
last_value取分组内排序后，截止到以后行，最初一个值

SQL
================================================
select name,orderdate,cost,
first_value(orderdate) over(partition by name order by orderdate) as time1,
last_value(orderdate) over(partition by name order by orderdate) as time2
from t_window

查问后果如下:
================================================
name    orderdate   cost    time1   time2
jack    2015-01-01  10  2015-01-01  2015-01-01
jack    2015-01-05  46  2015-01-01  2015-01-05
jack    2015-01-08  55  2015-01-01  2015-01-08
jack    2015-02-03  23  2015-01-01  2015-02-03
jack    2015-04-06  42  2015-01-01  2015-04-06
mart    2015-04-08  62  2015-04-08  2015-04-08
mart    2015-04-09  68  2015-04-08  2015-04-09
mart    2015-04-11  75  2015-04-08  2015-04-11
mart    2015-04-13  94  2015-04-08  2015-04-13

1.10.11 LAST_VALUE OVER

first_value取分组内排序后，截止到以后行，第一个值
last_value取分组内排序后，截止到以后行，最初一个值

SQL
================================================
select name,orderdate,cost,
first_value(orderdate) over(partition by name order by orderdate) as time1,
last_value(orderdate) over(partition by name order by orderdate) as time2
from t_window

查问后果如下:
================================================
name    orderdate   cost    time1   time2
jack    2015-01-01  10  2015-01-01  2015-01-01
jack    2015-01-05  46  2015-01-01  2015-01-05
jack    2015-01-08  55  2015-01-01  2015-01-08
jack    2015-02-03  23  2015-01-01  2015-02-03
jack    2015-04-06  42  2015-01-01  2015-04-06
mart    2015-04-08  62  2015-04-08  2015-04-08
mart    2015-04-09  68  2015-04-08  2015-04-09
mart    2015-04-11  75  2015-04-08  2015-04-11
mart    2015-04-13  94  2015-04-08  2015-04-13

1.10.12 GROUPING SET, CUBE, ROLL UP

待补充

+-----------+------------+-----------+------------+
|data_source|play_type_id|pay_type_id|GROUPING__ID|
+-----------+------------+-----------+------------+
|  playproxy|           2|         10|           0|
|        VIS|           1|         10|           0|
|  playproxy|           1|         10|           0|
|        VIS|           3|         10|           0|
|        VIS|           1|         10|           0|
|        IPM|           1|         10|           0|
|  playproxy|           1|         10|           0|
|        VIS|           1|       null|           2|
|        VIS|           1|       null|           2|
|  playproxy|           2|       null|           2|
|        VIS|           3|       null|           2|
|  playproxy|           1|       null|           2|
|  playproxy|           1|       null|           2|
|        IPM|           1|       null|           2|
|        IPM|        null|         10|           4|
|  playproxy|        null|         10|           4|
|        VIS|        null|         10|           4|
|        VIS|        null|         10|           4|
|  playproxy|        null|         10|           4|
|        IPM|        null|       null|           6|
|  playproxy|        null|       null|           6|
|  playproxy|        null|       null|           6|
|        VIS|        null|       null|           6|
|        VIS|        null|       null|           6|
|       null|           1|         10|           8|
|       null|           3|         10|           8|
|       null|           1|         10|           8|
|       null|           2|         10|           8|
|       null|           1|         10|           8|
|       null|           2|       null|          10|
|       null|           3|       null|          10|
|       null|           1|       null|          10|
|       null|           1|       null|          10|
|       null|           1|       null|          10|
|       null|        null|         10|          12|
|       null|        null|         10|          12|
|       null|        null|         10|          12|
|       null|        null|       null|          14|
|       null|        null|       null|          14|
|       null|        null|       null|          14|
+-----------+------------+-----------+------------+




1    2    3    x
1    2        x
1    3        x
1            x
    2    3    x
    2        x
        3    x
            x



1    2    x
1        x
    2    x
        x

1.11 HIVE优化

1.11.1 设置map和reducer个数

一、管制hive工作中的map数:

通常状况下，作业会通过input的目录产生一个或者多个map工作。
次要的决定因素有： input的文件总个数，input的文件大小，集群设置的文件块大小(目前为128M, 可在hive中通过set dfs.block.size;命令查看到，该参数不能自定义批改)；
举例：
a) 假如input目录下有1个文件a,大小为780M,那么hadoop会将该文件a分隔成7个块（6个128m的块和1个12m的块），从而产生7个map数
b) 假如input目录下有3个文件a,b,c,大小别离为10m，20m，130m，那么hadoop会分隔成4个块（10m,20m,128m,2m）,从而产生4个map数
即，如果文件大于块大小(128m),那么会拆分，如果小于块大小，则把该文件当成一个块。
是不是map数越多越好？
答案是否定的。如果一个工作有很多小文件（远远小于块大小128m）,则每个小文件也会被当做一个块，用一个map工作来实现，
而一个map工作启动和初始化的工夫远远大于逻辑解决的工夫，就会造成很大的资源节约。
而且，同时可执行的map数是受限的。
是不是保障每个map解决靠近128m的文件块，就居安思危了？
答案也是不肯定。比方有一个127m的文件，失常会用一个map去实现，但这个文件只有一个或者两个小字段，却有几千万的记录，
如果map解决的逻辑比较复杂，用一个map工作去做，必定也比拟耗时。

针对下面的问题3和4，咱们须要采取两种形式来解决：即缩小map数和减少map数；

如何合并小文件，缩小map数？

假如一个SQL工作：

Select count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04’;

该工作的inputdir /group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04
共有194个文件，其中很多是远远小于128m的小文件，总大小9G，失常执行会用194个map工作。
Map总共耗费的计算资源： SLOTS_MILLIS_MAPS= 623,020

我通过以下办法来在map执行前合并小文件，缩小map数：
set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

再执行下面的语句，用了74个map工作，map耗费的计算资源：SLOTS_MILLIS_MAPS= 333,500
对于这个简略SQL工作，执行工夫上可能差不多，但节俭了一半的计算资源。
大略解释一下，100000000示意100M, set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;这个参数示意执行前进行小文件合并，
后面三个参数确定合并文件块的大小，大于文件块大小128m的，依照128m来分隔，小于128m,大于100m的，依照100m来分隔，把那些小于100m的（包含小文件和分隔大文件剩下的），
进行合并,最终生成了74个块。

如何适当的减少map数

当input的文件都很大，工作逻辑简单，map执行十分慢的时候，能够思考减少Map数，来使得每个map解决的数据量缩小，从而进步工作的执行效率。
假如有这样一个工作：
Select data_desc,
count(1),
count(distinct id),
sum(case when …),
sum(case when …),
sum(…)
from a group by data_desc
如果表a只有一个文件，大小为120M，但蕴含几千万的记录，如果用1个map去实现这个工作，必定是比拟耗时的，这种状况下，咱们要思考将这一个文件正当的拆分成多个，
这样就能够用多个map工作去实现。
set mapred.reduce.tasks=10;
create table a_1 as
select * from a
distribute by rand(123);

这样会将a表的记录，随机的扩散到蕴含10个文件的a_1表中，再用a_1代替下面sql中的a表，则会用10个map工作去实现。
每个map工作解决大于12M（几百万记录）的数据，效率必定会好很多。

看上去，貌似这两种有些矛盾，一个是要合并小文件，一个是要把大文件拆成小文件，这点正是重点须要关注的中央，
依据理论状况，管制map数量须要遵循两个准则：使大数据量利用适合的map数；使单个map工作解决适合的数据量；

1.11.2 管制hive工作的reduce数：

Hive本人如何确定reduce数：
reduce个数的设定极大影响工作执行效率，不指定reduce个数的状况下，Hive会猜想确定一个reduce个数，基于以下两个设定：
hive.exec.reducers.bytes.per.reducer（每个reduce工作解决的数据量，默认为1000^3=1G）
hive.exec.reducers.max（每个工作最大的reduce数，默认为999）
计算reducer数的公式很简略N=min(参数2，总输出数据量/参数1)
即，如果reduce的输出（map的输入）总大小不超过1G,那么只会有一个reduce工作；
如：select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04’ group by pt;
```
     /group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 总大小为9G多，因而这句有10个reduce
```
调整reduce个数办法一：
调整hive.exec.reducers.bytes.per.reducer参数的值；
set hive.exec.reducers.bytes.per.reducer=500000000; （500M）
select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04’ group by pt; 这次有20个reduce
调整reduce个数办法二；
set mapred.reduce.tasks = 15;
select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04’ group by pt;这次有15个reduce
reduce个数并不是越多越好；
同map一样，启动和初始化reduce也会耗费工夫和资源；
另外，有多少个reduce,就会有多少个输入文件，如果生成了很多个小文件，那么如果这些小文件作为下一个工作的输出，则也会呈现小文件过多的问题；
什么状况下只有一个reduce；
很多时候你会发现工作中不论数据量多大，不论你有没有设置调整reduce个数的参数，工作中始终都只有一个reduce工作；
其实只有一个reduce工作的状况，除了数据量小于hive.exec.reducers.bytes.per.reducer参数值的状况外，还有以下起因：
a) 没有group by的汇总，比方把select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04’ group by pt; 写成 select count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04’;
这点十分常见，心愿大家尽量改写。
b) 用了Order by
c) 有笛卡尔积
通常这些状况下，除了找方法来变通和防止，我临时没有什么好的方法，因为这些操作都是全局的，所以hadoop不得不用一个reduce去实现；
同样的，在设置reduce个数的时候也须要思考这两个准则：使大数据量利用适合的reduce数；使单个reduce工作解决适合的数据量；

优化参考; https://www.cnblogs.com/wuxia…

1.11.1 mapreduce的根本流程及阶段可进行的优化操作

附上网上的一张老图（尽管老然而很具备代表性，哈哈哈），下面的图晓得，对于mapreduce工作可分为以上几个步骤，input、splitting、Mapping、Shuffing、reducing、finalresult，咱们能够对这些阶段进行别离的优化！！

Splitting阶段优化：将输出数据进行依照大小节分，分成不同的块

           优化点：1.适当减少切割块的大小（单个节点先依照maxsize进行切分，剩下的进行minsize大小的合并。而后是节点之间的合并，最初是机架之间的合并。）

                     set mapreduce.input.fileinputformat.split.minsize = 1024000000;
                     set mapreduce.input.fileinputformat.split.maxsize = 1024000000;（默认256M）
                     set mapreduce.input.fileinputformat.split.minsize.per.node= 1024000000;
                     set mapreduce.input.fileinputformat.split.maxsize.per.node= 1024000000;（默认1b）
                     set mapreduce.input.fileinputformat.split.minsize.per.rack= 1024000000; 
                     set mapreduce.input.fileinputformat.split.maxsize.per.rack= 1024000000;（默认1b）
                     set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;（默认存在）

Mapping阶段优化：将雷同的key进行对应的转换，其实map的个数取决于上一阶段切割的最终的数据块个数

          优化点：1.手动规定map的个数（例子：工作中有100个map，然而能够使map分批执行一批10个）

                     set mapreduce.job.running.map.limit=20;

          2.规定map同时并行的个数（例子：工作中有100个map，然而能够使map分批执行一批10个）

                     set mapreduce.map.memory.mb=3584;(3.5G内存)  默认2.1倍会杀掉

          3.限度map可应用的最大内存

                     set hive.map.aggr = true
                     set hive.groupby.mapaggr.checkinterval = 100000（将数据每100000进行聚合）

Shuffing阶段优化：将雷同的的key放到一个reduce中，其实是一个网络传输的过程。

 优化点：1.map后进行压缩（压缩后主动解压）

                     set mapreduce.map.output.compress=true（map输入压缩，mapreduce参数）
                     set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec（map输入压缩格局，mapreduce参数） 

          2.map后进行合并文件操作

                     set hive.merge.mapfiles = true（新启一个job实现合并，合并多大set hive.merge.size.per.task决定）

reducing阶段优化：进行数据累加操作，并将后果传输到对应的文件中。

  优化点：1.手动规定reduce的个数

                     set mapred.reduce.tasks = 20;

          2.规定reduce同时并行的个数（例子：工作中有100个reduce，然而能够使reduce分批执行一批10个）

                     set mapreduce.job.running.reduce.limit=80;

          3.限度reduce可应用的最大内存

                     set mapreduce.reduce.memory.mb=7168;(7G内存)    默认2.1倍会杀掉

          4.设置每个reduce可解决的数据大小（间接决定reduce个数）

                     set hive.exec.reducers.bytes.per.reducer=1024*1000*1000;

          5.reduce最大个数

                     set hive.exec.reducers.max =2000;(mapreduce.job.running.reduce.limit变相应用)

          6.reduce后可进行文件合并

                     set hive.merge.sparkfiles = false（spark引擎，完结后合并文件，新启动一个工作）
                     set hive.merge.tezfiles = false（tez引擎，完结后合并文件，新启动一个工作）
                     set hive.merge.mapredfiles = true（mapreduce引擎，完结后合并文件，新启动一个工作）
                     set hive.merge.smallfiles.avgsize =100*1000*1000 （输入文件小于以后值的时候，工作完结后合并文件的大小）
                     set hive.merge.size.per.task = 1024*1000*1000 （将文件合并成为多大）

finalresult阶段优化：其实就是写文件的过程。

 优化点：1.reduce后要进行压缩写到HDFS（每个节点独自跑工作，然而最初的后果须要汇聚到一处）

                     set mapreduce.output.fileoutputformat.compress=false // 默认值是 false reduce属性
                     set mapreduce.output.fileoutputformat.compress.type=BLOCK // 默认值是 Record reduce属性
                     set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec // 默认值是 org.apache.hadoop.io.compress.DefaultCodec

其余优化：

          JVM优化：一个JVM运行的job数是有下限的，咱们能够设置最大执行的个数

                     set mapreduce.job.jvm.numtasks=100

          并发及并发线程数优化：job的子查问可能没关系，所以能够开启并发查问

                     set hive.exec.parallel = true; 
                     set hive.exec.parallel.thread.number=8;

          数据歪斜优化：可进行key值个数的判断，判断时候产生数据歪斜

                     set hive.optimize.skewjoin=true;
                     set hive.skewjoin.key=100000;（超过10000个雷同的key就认为是数据歪斜，须要进行打散解决）

          分区优化：hive有桶表和分区表，可开启动静分区（其实就是不同文件夹） 

                     set hive.exec.dynamic.partition=true
                     set hive.exec.dynamic.partition.mode=nonstrict（分区表分为严格模式和非严格模式）

          job之间优化：

                     set hive.exec.compress.output=true;（最终后果压缩。若map压缩和reduce压缩都没有用，改参数应用的话。两个job第一个job后数据不压缩，第二个job输入压缩）
                     set hive.exec.compress.intermediate=true（若map压缩reduce压缩最终输入压缩都没有用，改参数应用的话。两个job第一个job后数据压缩，第二个job输入不压缩）                          SQL优化：

        小表往前放

1.11.2 罕用hive参数优化

其实下面是对每一个阶段都进行数据优化，有很多参数都是默认开启或者有默认值的。

只须要用到罕用的几个就行，其余的作为理解。上面列举出比拟罕用的：

    其实下面是对每一个阶段都进行数据优化，有很多参数都是默认开启或者有默认值的。

    只须要用到罕用的几个就行，其余的作为理解。上面列举出比拟罕用的：

Splitting阶段：将输出小文件合并成为大文件
                     set mapreduce.input.fileinputformat.split.minsize = 1024000000;（参数mapreduce.map.memory.mb=3584 默认2.1倍会杀掉，一个map申请3.5G内存不用浪费了）
                     set mapreduce.input.fileinputformat.split.maxsize = 1024000000;
                     set mapreduce.input.fileinputformat.split.minsize.per.node= 1024000000;
                     set mapreduce.input.fileinputformat.split.maxsize.per.node= 1024000000;
                     set mapreduce.input.fileinputformat.split.minsize.per.rack= 1024000000; 
                     set mapreduce.input.fileinputformat.split.maxsize.per.rack= 1024000000;

map阶段个别很快，参数能够不设置
reduce阶段
                     set mapreduce.job.running.reduce.limit=80;（例子：工作中有100个reduce，然而能够使reduce分批执行一批10个）

合并文件
                     hive合并文件是新启动一个工作合并文件，感觉这个参数不太适合，有这个工夫不如间接输入（map和reduce阶段都是一样的）。

压缩文件
                     （这个参数非常好，压缩不仅仅节约空间而且在网络传输的时候比拟省宽带，mapreduce和spark都是默认能够解压缩的，比拟不便。）

                     set mapreduce.map.output.compress=true（map输入压缩，map阶段参数）
                     set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec（map输入压缩格局，map阶段参数）
                     set mapreduce.output.fileoutputformat.compress=false // 默认值是 false reduce阶段参数
                     set mapreduce.output.fileoutputformat.compress.type=BLOCK // 默认值是 Record reduce阶段参数
                     set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec // 默认值是 org.apache.hadoop.io.compress.DefaultCodec
                     set hive.exec.compress.output=true;（最终后果压缩。若map压缩和reduce压缩都没有用，改参数应用的话。两个job第一个job后数据不压缩，第二个job输入压缩）
                     set hive.exec.compress.intermediate=true（若map压缩reduce压缩最终输入压缩都没有用，改参数应用的话。两个job第一个job后数据压缩，第二个job输入不压缩）

Jvm优化（倡议不设置）
并发优化
                     set hive.exec.parallel = true;
                     set hive.exec.parallel.thread.number=8;

数据歪斜优化：
                     set hive.optimize.skewjoin=true;
                     set hive.skewjoin.key=100000;（超过10000个雷同的key就认为是数据歪斜，须要进行打散解决）

分区优化（建表的时候要partation by ）：
                     set hive.exec.dynamic.partition=true
                     set hive.exec.dynamic.partition.mode=nonstrict（分区表分为严格模式和非严格模式）

SQL优化
                     小表往前放

1.11.3 hivesql内存计算过程，mapreduce确定及罕用UI端口

1.hive参数配置

Hive UI——>>>>Hive Configuration页签找到（版本与版本之间参数的名字不太一样肯定要看清楚配置的名字，最好应用最新的配置名称）

2.hivesql应用内存资源

hivesql应用资源可在YARN的治理界面中RUNNING中看到，须要时不时刷新因为是动静的。（个别为3.5gmap个数字+7greduce个数）

3.hivesql生成的job的map个数reduce个数

可在YARN的治理界面中FINISHED中找到job后点击History，进去之后就会看到map数和reduce数

Splitting块个数决定map个数，reduce个数取决于输入的大小（1G一个reduce）

4.罕用UI：

1、HDFS页面：50070

2、YARN的治理界面：8088

3、HistoryServer的治理界面：19888

4、Zookeeper的服务端口号：2181

5、Hive.server2=10002

6、Kafka的服务端口号：9092

7、Hbase界面：16010,60010

8、Spark的界面：8080

1.11 数仓分层

https://www.cnblogs.com/nthfo…