Hive 实现自增序列
在利用数据仓库进行数据处理时,通常有这样一个业务场景,为一个 Hive 表新增一列自增字段(比方事实表和维度表之间的 ” 代理主键 ”)。尽管 Hive 不像 RDBMS 如 mysql 一样自身提供自增主键的性能,但它自身能够通过函数来实现自增序列性能:利用 row_number()窗口函数或者应用 UDFRowSequence。
示例:table_src 是咱们通过业务需要解决失去的两头表数据,当初咱们须要为 table_src 新增一列自增序列字段 auto_increment_id,并将最终数据保留到 table_dest 中。
1. 利用 row_number 函数
场景 1:table_dest 中目前没有数据
insert into table table_dest
select row_number() over(order by 1) as auto_increment_id, table_src.* from table_src; 场景 2: table_dest 中有数据,并且曾经通过新增自增字段解决
insert into table table_dest
select (row_number() over(order by 1) + dest.max_id) auto_increment_id, src.* from table_src src cross join (select max(auto_increment_id) max_id from table_dest) dest;
2. 利用 UDFRowSequence
首先 Hive 环境要有 hive-contrib 相干 jar 包,而后执行
create temporary function row_sequence as ‘org.apache.hadoop.hive.contrib.udf.UDFRowSequence’; 针对上述场景一,可通过以下语句实现:
insert into table table_dest
select row_sequence() auto_increment_id, table_src.* from table_src; 场景 2 实现起来也很简略,这里不再赘述。
然而,须要留神二者的区别:
row_number 函数是对整个数据集做解决,自增序列在当次排序中是间断的惟一的。
UDFRowSequence 是依照工作排序,然而一个 SQL 可能并发执行的 job 不止一个,而每个 job 都会从 1 开始各自排序,所以不能保障序号全局惟一。能够思考将 UDFRowSequence 扩大到一个第三方存储系统中,进行序号逻辑治理,来最终实现全局的间断自增惟一序号。
Hive 元数据问题
以下基于 hive-2.X 版本阐明。
Hive 失常启动,然而执行 show databases 时报以下谬误:
SemanticException org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient 首先从异样信息剖析可知,是元数据问题导致的异样。
Hive 默认将元数据存储在 derby,但因为用 derby 作为元数据存储服务弊病太多,咱们通常会抉择将 Hive 的元数据存在 mysql 中。所以咱们要确保 hive-site.xml 中 mysql 的信息要配置正确,Hive 要有 mysql 的相干连贯驱动 jar 包,并且有 mysql 的权限。
首先在 hive-site.xml 中配置 mysql 信息:
<configuration>
<property>
<!-- mysql 中存储 Hive 元数据的库 -->
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive_metadata?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<!-- mysql 用户名 -->
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<!-- mysql 明码 -->
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
</configuration> 执行完上述操作后,如果配置了 Hive metastore 形式,还须要启动该服务:
nohup hive –service metastore & 如果想启动 hiveserver2,则执行:
nohup hive –service hiveserver2 & 然而,此时可能因为你设置的 mysql 元数据存储库没有进行 schema 初始化,会报相似以下异样:
— 异样 1
Exception in thread “main” MetaException(message:Version information not found in metastore.)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:83)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6896)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6891)
at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:7149)
at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:7076)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
Caused by: MetaException(message:Version information not found in metastore.)– 异样 2
MetaException(message:Required table missing : “DBS
” in Catalog “” Schema “”. DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable “datanucleus.schema.autoCreateTables”)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:83)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6896)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6891)
at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:7149)
at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:7076)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)此时,还须要在 hive-site.xml 中配置以下信息:
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description> Enforce metastore schema version consistency. True: Verify that version information stored in is compatible with one from Hive jars. Also disable automatic schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures proper metastore schema migration. (Default); False: Warn if the version information stored in metastore doesn’t match with one from in Hive jars.
</description>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property> 并执行 schematool -initSchema -dbType mysql 进行 Hive 元数据的初始化。呈现以下信息则阐明初始化结束,能够到 mysql 中元数据库看到初始化生成的表。
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/soft/apache-hive-2.3.7-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/soft/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.ht… for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://localhost:3306/hive_metadata?createDatabaseIfNotExist=true
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: root
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed 最初,重新启动 bin/hive,执行 show databases、建表、查问等 SQL 语句进行测试,都能失常执行。