Hadoop的搭建和第一个Hadoop小项目单词计数

Hadoop的搭建

我自己是在windows10上搭建的hadoop。

参考资料如下:

1.hadoop详细安装及配置

2.winutils下载

3.hadoop3.0.3下载

4hadoop启动报错java.lang.NoClassDefFoundError:/org/apache/hadoop/yarn/server/timelineCollectorManager

第一个Hadoop小项目：单词计数

单词计数应该是很多人入门Hadoop的第一个小项目。我自己看的参考资料是《MapReduce设计模式》。运作这个小例子是不需要启动Hadoop的。

采坑总结：
(1)Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir ar

我自己的解决方案是在系统变量添加HADOOP_HOME，在系统变量的PATH里添加bin，重启IDEA。之前在用户变量里添加过不知道为什么没生效，所以在系统变量里加。用以下代码验证：

System.out.println(System.getenv("HADOOP_HOME"));
System.out.println(System.getenv("PATH"));

如果有些人报错说找不到winutils.exe，需要去下载winutils的包，把对应版本的bin文件夹替换hadoop的bin。我在【hadoop的搭建】部分的参考资料有给下载的github地址。

(2)Maven的依赖问题。

Exception in thread "main" java.lang.VerifyError: Bad return type
'org/apache/hadoop/mapred/JobStatus' (current frame, stack[0]) is not assign 'org/apache/hadoop/mapreduce/JobStatus'

这个我在网上没有找到解决方法，但是我的程序是参照《MapReduce设计模式》来的，确定应该不是程序的问题之后，应该只能是Maven依赖的问题。修改后，我的项目的依赖包括：hadoop-common、hadoop-hdfs、hadoop-mapreduce-client-core、hadoop-mapreduce-client-jobclient、hadoop-mapreduce-client-common。版本都是3.0.3，因为我搭建的Hadoop版本是3.0.3。

(3)也是Maven依赖问题。

java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

添加hadoop-mapreduce-client-jobclient、hadoop-mapreduce-client-common这两个依赖就好。
参考资料：https://blog.csdn.net/qq_2012…

完整的代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.util.StringTokenizer;

/**
 * @Author liuffei
 * @Date 2019/7/13 9:41
 * @Description
 */
public class CommentWordCount {

    //Mapper<Object, Text,Text, IntWritable>表示输入键，输入值，输出键，输出值
    //mapper输入的键值是在作业配置的FileInputFormat中定义的。
    public static class WordCountMapper extends Mapper<Object, Text,Text, IntWritable> {
        //设置计数为1
        IntWritable one = new IntWritable(1);
        Text word = new Text();

        //覆盖了Mapper类的map方法
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
            String txt = value.toString();
            //将输入值中的非字母替换为空字符串
            txt = txt.replaceAll("[^a-zA-Z]","");
            StringTokenizer stringTokenizer = new StringTokenizer(txt);
            while(stringTokenizer.hasMoreTokens()) {
                word.set(stringTokenizer.nextToken());
                //将每个单词计数为1，并保存。
                context.write(word, one);
            }
        }
    }

    //Reducer<Text, IntWritable,Text, IntWritable>表示输入键，输入值，输出键，输出值
    //Reducer的输入键输入值应该和Mapper的输出键输出值的类型保持一致
    public static class IntSumReducer extends Reducer<Text, IntWritable,Text, IntWritable> {

        public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{
            int sum = 0;
            for (IntWritable val:values) {
                sum += val.get();
            }
            context.write(key,new IntWritable(sum));
        }
    }

    public static void main(String[] args){
        try {
            Configuration conf = new Configuration();
            String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
            if(otherArgs.length != 2) {
                System.err.println("need enter input and output directory path");
                System.exit(2);
            }
            Job job = Job.getInstance(conf, "Word Count");
            //与自己定义的类名保持一致
            job.setJarByClass(CommentWordCount.class);
            //与自己定义的Mapper类和Reducer类保持一致
            job.setMapperClass(WordCountMapper.class);
            job.setCombinerClass(IntSumReducer.class);
            job.setReducerClass(IntSumReducer.class);
            //设置的输出键和输出值和mapper定义的需要保持一致。
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            //输入输出路径
            FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
            System.exit(job.waitForCompletion(true)?0:1);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

运行main方法之前，我在自己的项目的src同级目录建立了input文件夹，并建立了两个txt文件(注意文件的读取是按行的，所以每个单词单独一行)。运行main方法时，添加输入输出路径。文件夹的路径大家可以自己定义。output文件夹不需要自己建立，会自动建立，每次运行时需要把之前生成的output文件夹删除，不然会报output文件夹已经存在的错。

以上就是自己的一些总结，学习Hadoop道阻且长，希望自己可以坚持下去。

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能