IDEA配置Hadoop开发环境&编译运行WordCount程序

有关hadoop及java安装配置请见：https://www.cnblogs.com/lxc1910/p/11734477.html

1、新建Java project：

选择合适的jdk，如图所示：

将工程命名为WordCount。

2、添加WordCount类文件：

在src中添加新的Java类文件，类名为WordCount，代码如下：

 import java.io.IOException;

 import java.util.StringTokenizer;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.GenericOptionsParser;

 public class WordCount {

     public static class TokenizerMapper //定义Map类实现字符串分解

             extends Mapper<Object, Text, Text, IntWritable>

     {

         private final static IntWritable one = new IntWritable(1);

         private Text word = new Text();

         //实现map()函数

         public void map(Object key, Text value, Context context)

                 throws IOException, InterruptedException

         { //将字符串拆解成单词

             StringTokenizer itr = new StringTokenizer(value.toString());

             while (itr.hasMoreTokens())

             { word.set(itr.nextToken()); //将分解后的一个单词写入word类

                 context.write(word, one); //收集<key, value>

             }

         }

     }

     //定义Reduce类规约同一key的value

     public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>

     {

         private IntWritable result = new IntWritable();

         //实现reduce()函数

         public void reduce(Text key, Iterable<IntWritable> values, Context context )

                 throws IOException, InterruptedException

         {

             int sum = 0;

             //遍历迭代values，得到同一key的所有value

             for (IntWritable val : values) { sum += val.get(); }

             result.set(sum);

             //产生输出对<key, value>

             context.write(key, result);

         }

     }

     public static void main(String[] args) throws Exception

     { //为任务设定配置文件

         Configuration conf = new Configuration();

         //命令行参数

         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

         if (otherArgs.length != 2)

         { System.err.println("Usage: wordcount <in> <out>");

             System.exit(2);

         }

         Job job = Job.getInstance(conf, "word count");//新建一个用户定义的Job

         job.setJarByClass(WordCount.class); //设置执行任务的jar

         job.setMapperClass(TokenizerMapper.class); //设置Mapper类

         job.setCombinerClass(IntSumReducer.class); //设置Combine类

         job.setReducerClass(IntSumReducer.class); //设置Reducer类

         job.setOutputKeyClass(Text.class); //设置job输出的key

         //设置job输出的value

         job.setOutputValueClass(IntWritable.class);

         //设置输入文件的路径

         FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

         //设置输出文件的路径

         FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

         //提交任务并等待任务完成

         System.exit(job.waitForCompletion(true) ? 0 : 1);

     }

 }

3、添加依赖库：

点击 File -> Project Structure -> Modules，选择Dependencies，点击加号，添加以下依赖库：

IDEA配置Hadoop开发环境&编译运行WordCount程序

4、编译生成JAR包：

点击 File -> Project Structure ->Artifacts，点击加号->JAR->from modules with dependencies,

Mainclass选择WordCount类：

IDEA配置Hadoop开发环境&编译运行WordCount程序

下面开始编译生成JAR包：

点击 build->build Artifacts->build，完成编译后，会发现多出一个目录output.

5、在hadoop系统中运行JAR包：

我之前在hadoop用户下安装了伪分布式的hadoop系统，因此首先把JAR包复制到hadoop用户目录下。

启动hadoop服务：(在hadoop安装目录的sbin文件夹下)

./start-all.sh

在hdfs下新建test-in文件夹，并放入file1.txt、file2.txt两个文件，

 hadoop fs -mkdir test-in

 hadoop fs -put file1.txt file2.txt test-in/

执行jar包：

 hadoop jar WordCount.jar test-in test-out

因为之前生成JAR包时设置了主类，所以WordCount.jar后面不需要再加WordCount.

另外需要注意运行JAR包之前hdfs中不能有test-out文件夹。

6、查看运行结果

可通过http://localhost:50070/查看hadoop系统状况，

点击Utilities->Browse the file system即可查看hdfs文件系统：

IDEA配置Hadoop开发环境&编译运行WordCount程序

可以看到test-out文件下有输出文件，可通过命令：

 hadoop fs -cat test-out/part-r-

查看文件输出情况：

IDEA配置Hadoop开发环境&编译运行WordCount程序

7、参考

https://blog.****.net/chaoping315/article/details/78904970

https://blog.****.net/napoay/article/details/68491469

https://blog.****.net/ouyang111222/article/details/73105086

秒客网

IDEA配置Hadoop开发环境&编译运行WordCount程序

相关文章