Hadoop之HelloWorld

时间:2023-03-09 17:43:00
Hadoop之HelloWorld

Hadoop开始:

1. 下载最新的发行版,解压到你喜欢的路径。

2. 配置,Hadoop的配置文件位于~/hadoop/conf/ 目录下。这里我先只配置了core-site.xml文件。

 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/Jack/dfs</value>
</property>
</configuration>

上面我指定了hadoop的DFS文件系统的路径。

3. 格式化DFS系统,输入命令: > ./hadoop namenode -format

4. 启动Hadoop,输入命令: > ./start-all.sh

**到这里Hadoop的启动已经正常,可以在端口50070和50030查看集群的状态。

======================================================================

第一个程序:HadoopHelloWorld

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*; public class HadoopHelloWorld { public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> {
private final static IntWritable one=new IntWritable(1);
private Text word=new Text(); public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter)
throws IOException {
String line= value.toString();
StringTokenizer tokenizer=new StringTokenizer(line);
while(tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
} public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key,Iterator<IntWritable> values,OutputCollector<Text,IntWritable>output, Reporter reporter)
throws IOException{
int sum=0;
while(values.hasNext()) {
sum+=values.next().get();
}
output.collect(key, new IntWritable(sum)); }
} public static void main(String args[]) throws Exception {
JobConf conf=new JobConf(HadoopHelloWorld.class);
conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
} }

HadoopHelloWorld

需要引入的基础包:

JRE system Library

Hadoop-core.jar

commons-logging.jar

说明一下,别的文档中没有将需要commons-logging.jar 这个包,可以我的没有这个包一直报错。java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

以上工作做好了之后,编译HadoopHelloWorld.java文件就好,将生成的class文件放入文件夹~/source/java2013/HadoopHelloWorld/,然后打成一个jar包。

[Jack@win bin]$ jar -cvf HadoopHelloWorld.jar -C ~/source/java2013/HadoopHelloWorld/ .

上传2个input文件作为程序输入[ file01,file02 ]。

[Jack@win bin]$./ hadoop fs -mkdir input

[Jack@win bin]$ ./hadoop dfs -put ~/source/java2012/FirstJar/input/file* input

运行程序:

[Jack@win bin]$./hadoop jar HadoopHelloWorld.jar HadoopHelloWorld input output

13/06/20 03:16:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/06/20 03:16:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/20 03:16:45 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/20 03:16:45 INFO mapred.FileInputFormat: Total input paths to process : 4
13/06/20 03:16:45 INFO mapred.JobClient: Running job: job_201306200226_0002
13/06/20 03:16:46 INFO mapred.JobClient: map 0% reduce 0%
13/06/20 03:16:59 INFO mapred.JobClient: map 40% reduce 0%
13/06/20 03:17:05 INFO mapred.JobClient: map 80% reduce 0%
13/06/20 03:17:08 INFO mapred.JobClient: map 80% reduce 26%
13/06/20 03:17:11 INFO mapred.JobClient: map 100% reduce 26%
13/06/20 03:17:23 INFO mapred.JobClient: map 100% reduce 100%
13/06/20 03:17:28 INFO mapred.JobClient: Job complete: job_201306200226_0002
13/06/20 03:17:28 INFO mapred.JobClient: Counters: 30
13/06/20 03:17:28 INFO mapred.JobClient: Job Counters
13/06/20 03:17:28 INFO mapred.JobClient: Launched reduce tasks=1
13/06/20 03:17:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=32074
13/06/20 03:17:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/06/20 03:17:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/06/20 03:17:28 INFO mapred.JobClient: Launched map tasks=5
13/06/20 03:17:28 INFO mapred.JobClient: Data-local map tasks=3
13/06/20 03:17:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=23534
13/06/20 03:17:28 INFO mapred.JobClient: File Input Format Counters
13/06/20 03:17:28 INFO mapred.JobClient: Bytes Read=54
13/06/20 03:17:28 INFO mapred.JobClient: File Output Format Counters
13/06/20 03:17:28 INFO mapred.JobClient: Bytes Written=41
13/06/20 03:17:28 INFO mapred.JobClient: FileSystemCounters
13/06/20 03:17:28 INFO mapred.JobClient: FILE_BYTES_READ=104
13/06/20 03:17:28 INFO mapred.JobClient: HDFS_BYTES_READ=541
13/06/20 03:17:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=128481
13/06/20 03:17:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
13/06/20 03:17:28 INFO mapred.JobClient: Map-Reduce Framework
13/06/20 03:17:28 INFO mapred.JobClient: Map output materialized bytes=128
13/06/20 03:17:28 INFO mapred.JobClient: Map input records=2
13/06/20 03:17:28 INFO mapred.JobClient: Reduce shuffle bytes=122
13/06/20 03:17:28 INFO mapred.JobClient: Spilled Records=16
13/06/20 03:17:28 INFO mapred.JobClient: Map output bytes=82
13/06/20 03:17:28 INFO mapred.JobClient: Total committed heap usage (bytes)=912719872
13/06/20 03:17:28 INFO mapred.JobClient: CPU time spent (ms)=5190
13/06/20 03:17:28 INFO mapred.JobClient: Map input bytes=50
13/06/20 03:17:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=487
13/06/20 03:17:28 INFO mapred.JobClient: Combine input records=0
13/06/20 03:17:28 INFO mapred.JobClient: Reduce input records=8
13/06/20 03:17:28 INFO mapred.JobClient: Reduce input groups=5
13/06/20 03:17:28 INFO mapred.JobClient: Combine output records=0
13/06/20 03:17:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=932745216
13/06/20 03:17:28 INFO mapred.JobClient: Reduce output records=5
13/06/20 03:17:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2390478848
13/06/20 03:17:28 INFO mapred.JobClient: Map output records=8

Result