hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

时间:2022-11-09 13:09:03
   我的hadoop版本2.7.1,JDK版本1.7。作为一个新手,今天利用windows下的Eclipse导入hadoop WordCount例子的源码,运行时却出现了众多错误,浪费了这么多时间,实在可惜。
hadoop2.x版本和1.x版本的差别很大嘛。不仅仅体现的是启动HDFS的命令不同,还有很多放配置文件的目录也不一样,着实让人无奈。

一.使用Eclipse编译WordCount的源代码,并打包jar。

1. hadoop2.7.1版本的WordCount实例在x:\hadoop-2.7.1\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.1-sources.jar。解压后,便可以得到WordCount.java的源代码。如下所示:

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job,
new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}<strong>
</strong>

2. 在Eclipse中,新建java project,project名为WordCount,在project中新建package, " package org.apache.hadoop.examples; " ,或是自己自定义一个新的包名。把hadoop源代码复制到Eclipse中。


hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

3. 从上面的代码中,可以看到import了好几个hadoop自定义的类,要把这些依赖包导入到Eclipse中,不然编译器会因为找不到这些类报错,这时就要让编译器知道这些类的位置了。


hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

4. 此时,用Eclipse编译运行,成功并输出:

<span style="font-size:14px;color:#ff0000;">Usage: wordcount <in> [<in>...] <out></span>
接着就要打包成jar文件:

5. 在左边的文件目录右键Export:

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

6. 在桌面上,我们就可以看到打包的jar文件了:


hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~


7. 我用的是WinScp传到centos根目录下面:

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

二.在hadoop中运行WordCount例子

1. 我们把WordCount.jar包发到了我们的根目录下,看看它在不在~(OK!可以有)

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

2. 下面启动之前在hadoop安装的伪分布式集群:

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

3. 使用jps命令查看(在hadoop2.x版本中成功启动了DataNode就默认启动了JobTracker和TaskTracker):

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

4. (1) 创建/root下创建本地文件夹file

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~
   

    (2) 进入file目录,创建file1.txt和file2.txt,添加内容:

 hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~   

   (3) 在HDFS中创建input文件夹,并且把file文件夹下的文件传入/input(输入目录)文件夹中:
使用hadoop fs -mkdir /input(此处的 / 是HDFS中的主目录);
使用hadoop fs -put /file/file*.txt  /input (把file下的文件传入input目录中);
使用hadoop fs -ls /input 查看文件是否正确传入到/input目录下;

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

   (4) 运行WordCount例子:首先把根目录的本地文件" WordCount.jar "移动到/home/hadoop-2.7.1下。我们在eclipse下直接使用了源代码中的package,那么jar文件中就有层次的,需要详细指出WordCount(这个是主类的类名),运行命令改为 " org.apache.hadoop.examples.WordCount /input /output "。反之,使用的是自定义的包名,首页没有自带 "package org.apache.hadoop.examples; " , 则使用 " hadoop jar WordCount.jar WordCount input output "命令运行。

[root@localhost hadoop-2.7.1]# hadoop jar WordCount.jar org.apache.hadoop.examples.WordCount /input /output
15/07/31 09:07:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/31 09:07:51 INFO Configuration.deprecation: session.id is deprecated. Instead,use dfs.metrics.session-id
15/07/31 09:07:51 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/07/31 09:07:51 INFO input.FileInputFormat: Total input paths to process : 2
15/07/31 09:07:51 INFO mapreduce.JobSubmitter: number of splits:2
15/07/31 09:07:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local999674240_0001
15/07/31 09:07:53 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/07/31 09:07:53 INFO mapreduce.Job: Running job: job_local999674240_0001
15/07/31 09:07:53 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/07/31 09:07:53 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/07/31 09:07:53 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/07/31 09:07:53 INFO mapred.LocalJobRunner: Waiting for map tasks
15/07/31 09:07:53 INFO mapred.LocalJobRunner: Starting task: attempt_local999674240_0001_m_000000_0
15/07/31 09:07:53 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/07/31 09:07:53 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
15/07/31 09:07:53 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/input/file2.txt:0+25
15/07/31 09:07:54 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/07/31 09:07:54 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/07/31 09:07:54 INFO mapred.MapTask: soft limit at 83886080
15/07/31 09:07:54 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/07/31 09:07:54 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/07/31 09:07:54 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/07/31 09:07:54 INFO mapred.LocalJobRunner:
15/07/31 09:07:54 INFO mapred.MapTask: Starting flush of map output
15/07/31 09:07:54 INFO mapred.MapTask: Spilling map output
15/07/31 09:07:54 INFO mapred.MapTask: bufstart = 0; bufend = 45; bufvoid = 104857600
15/07/31 09:07:54 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214380(104857520); length = 17/6553600
15/07/31 09:07:54 INFO mapred.MapTask: Finished spill 0
15/07/31 09:07:54 INFO mapred.Task: Task:attempt_local999674240_0001_m_000000_0 is done. And is in the process of committing
15/07/31 09:07:54 INFO mapred.LocalJobRunner: map
15/07/31 09:07:54 INFO mapred.Task: Task 'attempt_local999674240_0001_m_000000_0' done.
15/07/31 09:07:54 INFO mapred.LocalJobRunner: Finishing task: attempt_local999674240_0001_m_000000_0
15/07/31 09:07:54 INFO mapred.LocalJobRunner: Starting task: attempt_local999674240_0001_m_000001_0
15/07/31 09:07:54 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/07/31 09:07:54 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
15/07/31 09:07:54 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/input/file1.txt:0+12
15/07/31 09:07:54 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/07/31 09:07:54 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/07/31 09:07:54 INFO mapred.MapTask: soft limit at 83886080
15/07/31 09:07:54 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/07/31 09:07:54 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/07/31 09:07:54 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/07/31 09:07:54 INFO mapred.LocalJobRunner:
15/07/31 09:07:54 INFO mapred.MapTask: Starting flush of map output
15/07/31 09:07:54 INFO mapred.MapTask: Spilling map output
15/07/31 09:07:54 INFO mapred.MapTask: bufstart = 0; bufend = 20; bufvoid = 104857600
15/07/31 09:07:54 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214392(104857568); length = 5/6553600
15/07/31 09:07:54 INFO mapred.MapTask: Finished spill 0
15/07/31 09:07:54 INFO mapred.Task: Task:attempt_local999674240_0001_m_000001_0 is done. And is in the process of committing
15/07/31 09:07:54 INFO mapred.LocalJobRunner: map
15/07/31 09:07:54 INFO mapred.Task: Task 'attempt_local999674240_0001_m_000001_0' done.
15/07/31 09:07:54 INFO mapred.LocalJobRunner: Finishing task: attempt_local999674240_0001_m_000001_0
15/07/31 09:07:54 INFO mapred.LocalJobRunner: map task executor complete.
15/07/31 09:07:54 INFO mapred.LocalJobRunner: Waiting for reduce tasks
15/07/31 09:07:54 INFO mapred.LocalJobRunner: Starting task: attempt_local999674240_0001_r_000000_0
15/07/31 09:07:54 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/07/31 09:07:54 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
15/07/31 09:07:54 INFO mapreduce.Job: Job job_local999674240_0001 running in uber mode : false
15/07/31 09:07:54 INFO mapreduce.Job: map 100% reduce 0%
15/07/31 09:07:54 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@101cd25f
15/07/31 09:07:54 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=363285696,maxSingleShuffleLimit=90821424, mergeThreshold=239768576, ioSortFactor=10, memToMemM ergeOutputsThreshold=10
15/07/31 09:07:54 INFO reduce.EventFetcher: attempt_local999674240_0001_r_000000_0 Threadstarted: EventFetcher for fetching Map Completion Events
15/07/31 09:07:54 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of mapattempt_local999674240_0001_m_000001_0 decomp: 26 len: 30 to MEMORY
15/07/31 09:07:54 INFO reduce.InMemoryMapOutput: Read 26 bytes from map-output for attempt_local999674240_0001_m_000001_0
15/07/31 09:07:54 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 26, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->26
15/07/31 09:07:54 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local999674240_0001_m_000000_0 decomp: 57 len: 61 to MEMORY
15/07/31 09:07:54 INFO reduce.InMemoryMapOutput: Read 57 bytes from map-output for at tempt_local999674240_0001_m_000000_0
15/07/31 09:07:54 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 57, inMemoryMapOutputs.size() -> 2, commitMemory -> 26, usedMemory ->83
15/07/31 09:07:54 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
15/07/31 09:07:54 INFO mapred.LocalJobRunner: 2 / 2 copied.
15/07/31 09:07:54 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
15/07/31 09:07:54 INFO mapred.Merger: Merging 2 sorted segments
15/07/31 09:07:54 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 68 bytes
15/07/31 09:07:54 INFO reduce.MergeManagerImpl: Merged 2 segments, 83 bytes to disk to satisfy reduce memory limit
15/07/31 09:07:54 INFO reduce.MergeManagerImpl: Merging 1 files, 85 bytes from disk
15/07/31 09:07:54 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
15/07/31 09:07:54 INFO mapred.Merger: Merging 1 sorted segments
15/07/31 09:07:54 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 73 bytes
15/07/31 09:07:54 INFO mapred.LocalJobRunner: 2 / 2 copied.
15/07/31 09:07:54 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
15/07/31 09:07:55 INFO mapred.Task: Task:attempt_local999674240_0001_r_000000_0 is done. And is in the process of committing
15/07/31 09:07:55 INFO mapred.LocalJobRunner: 2 / 2 copied.
15/07/31 09:07:55 INFO mapred.Task: Task attempt_local999674240_0001_r_000000_0 is allowed to commit now
15/07/31 09:07:55 INFO output.FileOutputCommitter: Saved output of task 'attempt_local999674240_0001_r_000000_0' to hdfs://localhost:9000/output/_temporary/0/task_local99 9674240_0001_r_000000
15/07/31 09:07:55 INFO mapred.LocalJobRunner: reduce > reduce
15/07/31 09:07:55 INFO mapred.Task: Task 'attempt_local999674240_0001_r_000000_0' don e.
15/07/31 09:07:55 INFO mapred.LocalJobRunner: Finishing task: attempt_local999674240_0001_r_000000_0
15/07/31 09:07:55 INFO mapred.LocalJobRunner: reduce task executor complete.
15/07/31 09:07:55 INFO mapreduce.Job: map 100% reduce 100%
15/07/31 09:07:56 INFO mapreduce.Job: Job job_local999674240_0001 completed successfully
15/07/31 09:07:56 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=821810
FILE: Number of bytes written=1644944
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=99
HDFS: Number of bytes written=51
HDFS: Number of read operations=22
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=2
Map output records=7
Map output bytes=65
Map output materialized bytes=91
Input split bytes=204
Combine input records=7
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=91
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=85
Total committed heap usage (bytes)=456732672
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=37
File Output Format Counters
Bytes Written=51

   (5) 运行成功后,查看/output(输出目录可以不用先创建,在HDFS运行过程中便会自动创建)


hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

   (6) 查看结果并且输出结果(结果在"part-r-00000"中):

hadoop2.7.1伪分布式集群中使用命令行运行WordCount例子~~~

   OK,WordCount例子至此运行完毕。。。要好好熟悉一下HDFS文件操作的命令了。。。