从搭建大数据环境说起,到执行 WordCount 所遇到的坑
[TOC]
背景阐明
最近 (2020 年 12 月 20 日) 在理解大数据相干架构及技术体系。
尽管说只是理解,不须要亲自动手去搭建一个环境并执行相应的job
。
然而,技术嘛。就是要靠 下笨功夫,一点点的积攒。该入手的还是不能少。
所以,就从搭环境 (基于docker
) 开始,始终到胜利执行了一个基于 yarn
调度的 wordcount
的job
。
期间,遇到了不少坑点,一个一个填好,大略花了 10
个小时左右的工夫。
心愿能将这种血泪教训,分享给须要的人。花更少的工夫,去实现整个流程。
留神:集体本地环境为macOS Big Sur
。
基于 docker compose
的大数据环境搭建
参考 docker-hadoop-spark-hive 疾速构建你的大数据环境 搭建了一个大数据环境,调整了局部参数,以实用于mac os
。
次要是如下五个文件:
.
├── copy-jar.sh # spark yarn 反对
├── docker-compose.yml # docker compose 文件
├── hadoop-hive.env # 环境变量配置
├── run.sh # 启动脚本
└── stop.sh # 进行脚本
留神:mac os
的 docker
有一个坑点就是无奈间接在宿主机拜访容器,我应用 Docker for Mac 的网络问题及解决办法(新增办法四)中的办法四解决的。
留神:须要在宿主机配置好相应 docker
容器对应的 ip
,这能力保障job
胜利执行,且各个服务在宿主机拜访的时候,跳转不会呈现问题。这坑很深,慎踩
。
# switch_local
172.21.0.3 namenode
172.21.0.8 resourcemanager
172.21.0.9 nodemanager
172.21.0.10 historyserver
docker-compose.yml
version: '2'
services:
namenode:
image: bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8
container_name: namenode
volumes:
- ~/data/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop-hive.env
ports:
- 50070:50070
- 8020:8020
resourcemanager:
image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.8-java8
container_name: resourcemanager
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop-hive.env
ports:
- 8088:8088
historyserver:
image: bde2020/hadoop-historyserver:1.1.0-hadoop2.8-java8
container_name: historyserver
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop-hive.env
ports:
- 8188:8188
datanode:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.8-java8
depends_on:
- namenode
volumes:
- ~/data/datanode:/hadoop/dfs/data
env_file:
- ./hadoop-hive.env
ports:
- 50075:50075
datanode2:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.8-java8
depends_on:
- namenode
volumes:
- ~/data/datanode2:/hadoop/dfs/data
env_file:
- ./hadoop-hive.env
ports:
- 50076:50075
datanode3:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.8-java8
depends_on:
- namenode
volumes:
- ~/data/datanode3:/hadoop/dfs/data
env_file:
- ./hadoop-hive.env
ports:
- 50077:50075
nodemanager:
image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.8-java8
container_name: nodemanager
hostname: nodemanager
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop-hive.env
ports:
- 8042:8042
hive-server:
image: bde2020/hive:2.1.0-postgresql-metastore
container_name: hive-server
env_file:
- ./hadoop-hive.env
environment:
- "HIVE_CORE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore/metastore"
ports:
- "10000:10000"
hive-metastore:
image: bde2020/hive:2.1.0-postgresql-metastore
container_name: hive-metastore
env_file:
- ./hadoop-hive.env
command: /opt/hive/bin/hive --service metastore
ports:
- 9083:9083
hive-metastore-postgresql:
image: bde2020/hive-metastore-postgresql:2.1.0
ports:
- 5432:5432
volumes:
- ~/data/postgresql/:/var/lib/postgresql/data
spark-master:
image: bde2020/spark-master:2.1.0-hadoop2.8-hive-java8
container_name: spark-master
hostname: spark-master
volumes:
- ./copy-jar.sh:/copy-jar.sh
ports:
- 18080:8080
- 7077:7077
env_file:
- ./hadoop-hive.env
spark-worker:
image: bde2020/spark-worker:2.1.0-hadoop2.8-hive-java8
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark://spark-master:7077
ports:
- "18081:8081"
env_file:
- ./hadoop-hive.env
hadoop-hive.env
HIVE_SITE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore-postgresql/metastore
HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName=org.postgresql.Driver
HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive
HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive
HIVE_SITE_CONF_datanucleus_autoCreateSchema=false
HIVE_SITE_CONF_hive_metastore_uris=thrift://hive-metastore:9083
HIVE_SITE_CONF_hive_metastore_warehouse_dir=hdfs://namenode:8020/user/hive/warehouse
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_fs_default_name=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle
run.sh
#!/bin/bash
# 启动容器
docker-compose -f docker-compose.yml up -d namenode hive-metastore-postgresql
docker-compose -f docker-compose.yml up -d datanode datanode2 datanode3 hive-metastore
docker-compose -f docker-compose.yml up -d resourcemanager
docker-compose -f docker-compose.yml up -d nodemanager
docker-compose -f docker-compose.yml up -d historyserver
sleep 5
docker-compose -f docker-compose.yml up -d hive-server
docker-compose -f docker-compose.yml up -d spark-master spark-worker
# 获取 ip 地址并打印到控制台
my_ip=`ifconfig | grep 'inet.*netmask.*broadcast' | awk '{print $2;exit}'`
echo "Namenode: http://${my_ip}:50070"
echo "Datanode: http://${my_ip}:50075"
echo "Spark-master: http://${my_ip}:18080"
# 执行脚本,spark yarn 反对
docker-compose exec spark-master bash -c "./copy-jar.sh && exit"
copy-jar.sh
#!/bin/bash
cd /opt/hadoop-2.8.0/share/hadoop/yarn/lib/ && cp jersey-core-1.9.jar jersey-client-1.9.jar /spark/jars/ && rm -rf /spark/jars/jersey-client-2.22.2.jar
stop.sh
#!/bin/bash
docker-compose stop
基于 IDEA
提交 MapReduce
至yarn
参考列表
- IDEA 向 hadoop 集群提交 MapReduce 作业
- java 操作 hadoop hdfs,实现文件上传下载 demo
- IDEA 近程提交 mapreduce 工作至 linux,遇到 ClassNotFoundException: Mapper
留神:在提交至 yarn
的时候,要将代码打成 jar
包,否则会报错ClassNotFoundExeption
。具体参考《IDEA 近程提交 mapreduce 工作至 linux,遇到 ClassNotFoundException: Mapper》。
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.switchvov</groupId>
<artifactId>hadoop-test</artifactId>
<version>1.0.0</version>
<name>hadoop-test</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.8.0</version>
</dependency>
</dependencies>
</project>
log4j.properties
log4j.rootLogger=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Target=System.out
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=[%p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%m%n
words.txt
this is a tests
this is a tests
this is a tests
this is a tests
this is a tests
this is a tests
this is a tests
this is a tests
this is a tests
HdfsDemo.java
package com.switchvov.hadoop.hdfs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import java.io.InputStream;
/**
* @author switch
* @since 2020/12/18
*/
public class HdfsDemo {
/**
* hadoop fs 的配置文件
*/
private static final Configuration CONFIGURATION = new Configuration();
static {
// 指定 hadoop fs 的地址
CONFIGURATION.set("fs.default.name", "hdfs://namenode:8020");
}
/**
* 将本地文件 (filePath) 上传到 HDFS 服务器的指定门路(dst)
*/
public static void uploadFileToHDFS(String filePath, String dst) throws Exception {
// 创立一个文件系统
FileSystem fs = FileSystem.get(CONFIGURATION);
Path srcPath = new Path(filePath);
Path dstPath = new Path(dst);
long start = System.currentTimeMillis();
fs.copyFromLocalFile(false, srcPath, dstPath);
System.out.println("Time:" + (System.currentTimeMillis() - start));
System.out.println("________筹备上传文件" + CONFIGURATION.get("fs.default.name") + "____________");
fs.close();}
/**
* 下载文件
*/
public static void downLoadFileFromHDFS(String src) throws Exception {FileSystem fs = FileSystem.get(CONFIGURATION);
Path srcPath = new Path(src);
InputStream in = fs.open(srcPath);
try {// 将文件 COPY 到规范输入(即控制台输入)
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {IOUtils.closeStream(in);
fs.close();}
}
public static void main(String[] args) throws Exception {
String filename = "words.txt";
// uploadFileToHDFS(
// "/Users/switch/projects/OtherProjects/bigdata-enviroment/hadoop-test/data/" + filename,
// "/share/" + filename
// );
downLoadFileFromHDFS("/share/output12/" + filename + "/part-r-00000");
}
}
WordCountRunner.java
package com.switchvov.hadoop.mapreduce.wordcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author switch
* @since 2020/12/17
*/
public class WordCountRunner {
/**
* LongWritable 行号 类型
* Text 输出的 value 类型
* Text 输入的 key 类型
* IntWritable 输入的 value 类型
*
* @author switch
* @since 2020/12/17
*/
public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
/**
* @param key 行号
* @param value 第一行的内容 如 this is a tests
* @param context 输入
* @throws IOException 异样
* @throws InterruptedException 异样
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();
// 以空格宰割获取字符串数组
String[] words = line.split(" ");
for (String word : words) {context.write(new Text(word), new IntWritable(1));
}
}
}
/**
* Text 输出的 key 的类型
* IntWritable 输出的 value 的类型
* Text 输入的 key 类型
* IntWritable 输入的 value 类型
*
* @author switch
* @since 2020/12/17
*/
public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
/**
* @param key 输出 map 的 key
* @param values 输出 map 的 value
* @param context 输入
* @throws IOException 异样
* @throws InterruptedException 异样
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable value : values) {count += value.get();
}
context.write(key, new IntWritable(count));
}
}
public static void main(String[] args) throws Exception {Configuration conf = new Configuration();
// 跨平台,保障在 Windows 下能够提交 mr job
conf.set("mapreduce.app-submission.cross-platform", "true");
// 配置 yarn 调度
conf.set("mapreduce.framework.name", "yarn");
// 配置 resourcemanager 的主机名
conf.set("yarn.resourcemanager.hostname", "resourcemanager");
// 配置默认了 namenode 拜访地址
conf.set("fs.defaultFS", "hdfs://namenode:8020");
conf.set("fs.default.name", "hdfs://namenode:8020");
// 配置代码 jar 包,否则会呈现 ClassNotFound 异样,参考:https://blog.csdn.net/qq_19648191/article/details/56684268
conf.set("mapred.jar", "/Users/switch/projects/OtherProjects/bigdata-enviroment/hadoop-test/out/artifacts/hadoop/hadoop.jar");
// 工作名
Job job = Job.getInstance(conf, "word count");
// 指定 Class
job.setJarByClass(WordCountRunner.class);
// 指定 Mapper Class
job.setMapperClass(WordCountMapper.class);
// 指定 Combiner Class,与 reduce 计算逻辑一样
job.setCombinerClass(WordCountReducer.class);
// 指定 Reucer Class
job.setReducerClass(WordCountReducer.class);
// 指定输入的 KEY 的格局
job.setOutputKeyClass(Text.class);
// 指定输入的 VALUE 的格局
job.setOutputValueClass(IntWritable.class);
// 设置 Reducer 个数默认 1
job.setNumReduceTasks(1);
// Mapper<Object, Text, Text, IntWritable> 输入格局必须与继承类的后两个输入类型统一
String filename = "words.txt";
String args0 = "hdfs://namenode:8020/share/" + filename;
String args1 = "hdfs://namenode:8020/share/output12/" + filename;
// 输出门路
FileInputFormat.addInputPath(job, new Path(args0));
// 输入门路
FileOutputFormat.setOutputPath(job, new Path(args1));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
分享并记录所学所见