MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

本文主要参考《Hadoop应用开发技术详解（作者：刘刚）》

一、工作环境

Windows7： Eclipse + JDK1.8.0

Ubuntu14.04：Hadoop2.9.0

二、准备工作——导入JAR包

1. 建一个Hadoop专用的工作空间

2. 在工作空间的目录下建一个专门用来存放开发MapReduce程序所需的Hadoop依赖的JAR包的文件夹

所需的JAR包在Ubuntu中$HADOOP_HOME/share/hadoop下，将JAR包复制到刚刚建好的文件夹中

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

需要的JAR包如下，可能有部分重复：

$HADOOP_HOME/share/hadoop/common & $HADOOP_HOME/share/hadoop/common/lib

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

$HADOOP_HOME/share/hadoop/hdfs & $HADOOP_HOME/share/hadoop/hdfs/lib

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

$HADOOP_HOME/share/hadoop/httpfs/tomcat/lib

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

$HADOOP_HOME/share/hadoop/kms/tomcat/lib

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

$HADOOP_HOME/share/hadoop/mapreduce & $HADOOP_HOME/share/hadoop/mapreduce/lib

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

$HADOOP_HOME/share/hadoop/tools/lib

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

$HADOOP_HOME/share/hadoop/yarn & $HADOOP_HOME/share/hadoop/yarn/lib

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

3. 新建用户库

Windows → Preference → Java → Build Path → User Libraries → New...

看到如下界面：

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

点击OK后看到如下界面：

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

点击Add External JARs... → 在刚刚建好的文件夹中选中所有JAR包 → 打开 → OK

用户库创建成功！

三、创建一个Java工程

File → New → Java Project

除了红框的内容，其他选项默认

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

右击项目名 → Build Path → Add Libraries... → User Library → 选中建好的用户库

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

四、MapReduce代码的实现

1. WordMapper类

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

package wordCount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

// 继承Mapper接口，设置Map的输入类型为<Object, Text>，输出类型为<Text, IntWritable>
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1); // one表示单词出现一次
    private Text word = new Text(); // word用于存储切下来的词
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString()); // 对输入的行切词
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken()); // 切下来的单词存入word
            context.write(word, one);
        }
    }
}

2. WordReducer类

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

package wordCount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

// 继承Reducer接口，设置Reduce的输入类型为<Text, IntWritable>，输出类型为<Text, IntWritable>
public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable(); // result记录单词的频数
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
　　　　　// 对获取的<key, IntWritable>计算value的和
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum); // 将频数设置到result中
        context.write(key, result); // 收集结果
    }
}

3. WordMain驱动类

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

package wordCount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordMain {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
　　　　　// 检查运行命令
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordCount <in> <out>");
            System.exit(2);
        }
　　　　　// 配置作业名
        Job job = new Job(conf, "word count");
　　　　　// 配置作业的各个类
        job.setJarByClass(WordMain.class);
        job.setMapperClass(WordMapper.class);
        job.setCombinerClass(WordReducer.class);
        job.setReducerClass(WordReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

五、打包成JAR文件

右击项目名 → Export → Java → JAR file

看到如下界面：

除了红框的内容，其他选项默认

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

点击Finish

JAR文件生成成功！

六、部署和运行

1. 把刚刚生成的JAR文件发送到Hadoop集群的Master节点的$HADOOP_HOME下面

2. 在Master节点的$HADOOP_HOME下面创建两个待统计词频的文件，file1.txt和file2.txt

file1.txt

Hello, I love coding
Are you OK?
Hello, I love hadoop
Are you OK?

file2.txt

Hello I love coding
Are you OK ?
Hello I love hadoop
Are you OK ?

3. 上传文件到HDFS系统中

$ hdfs dfs -put ./file* input

查看是否上传成功

$ hdfs dfs -ls input

4. 运行程序

$ hdfs dfs -rm -r output #如果HDFS系统中存在output目录
$ hadoop jar wordCount.jar wordCount.WordMain input/file* output

5. 查看运行结果

$ hdfs dfs -cat output/*

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

以上

秒客网

MapReduce程序——WordCount（Windows_Eclipse + Ubuntu14.04_Hadoop2.9.0）

相关文章