简述
hive on spark就是应用hive记录元数据,以及解析sql。应用spark的rdd真正执行计算工作。
hdfs负责存储
spark负责执行
hive负责数据管理
版本
apache-hive-3.1.3-bin.tar
spark-3.3.1-bin-hadoop3
spark-3.3.1-bin-without-hadoop
坑
因为初步接触大数据,有些常识性的坑,列在上面。
- hive目前只反对jdk1.8,所以hadoop、hive、spark集群里的jdk都得用1.8
- spark依赖的jar,上传到hdfs集群的时候,要用without-hadoop版本的。spark官网能够找到对应的版本
- hive适配3.x版本的spark,必须本人用源码编译
- win环境编译不太不便,因为很多sh脚本win环境下不能间接执行。我这里装了centos虚拟机,桌面版idea编译的
- hive sql执行失败的时候查看日志。拜访你集群yarn的治理界面,点击进去每次hive连贯,外面log有对应日志
配置文件
hive-site.xml
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://192.168.0.121:3306/metastore?useSSL=false</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>hadoop-4</value> </property> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> <property> <name>hive.cli.print.header</name> <value>true</value> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> </property><property> <name>spark.yarn.jars</name> <value>hdfs://hadoop-4:8020/spark-jars/*</value></property> <property> <name>hive.execution.engine</name> <value>spark</value></property><property> <name>hive.spark.client.connect.timeout</name> <value>100000ms</value></property><property><name>hive.auto.convert.join</name><value>false</value><description>Enables the optimization about converting common join into mapjoin</description></property></configuration>
spark-defaults.conf
spark.master yarnspark.eventLog.enabled truespark.eventLog.dir hdfs://hadoop-4:8020/spark-historyspark.executor.memory 1gspark.driver.memory 1g
spark-env.sh
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
hive spark版本不兼容
https://blog.csdn.net/lilyjok...
https://cxyzjd.com/article/we...