Hadoop 2:Mapper和Reduce

时间:2023-03-09 07:28:04
Understanding and Practicing Hadoop Mapper and Reduce

1 Mapper过程

Hadoop将输入数据划分为等长的小数据块(默认为64MB)的过程叫做分片,并为每个分片构建一个Mappper任务,并由Mapper任务执行用户自定义的函数处理分片中的数据,mapper就是将这些数据中包含我们感兴趣或要处理的数据构成一个以键值存储的数据集,比如按年月分析NCDC每月最高温度信息(关于NCDC温度数据格式和说明,请参考官方说明文档NCDC DATA Readme.txt);

STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB      WDSP     MXSPD   GUST    MAX     MIN
484310 99999 19720101 69.1 18 50.8 18 1034.4 17 1007.0 17 7.0 18 6.6 18 19.0 999.9 79.3* 60.3*
484310 99999 19720102 67.6 19 51.4 19 1032.3 19 1004.9 19 6.9 19 3.9 19 8.0 999.9 78.3* 58.3*
484310 99999 19720103 72.6 14 52.8 14 1032.0 14 1004.9 14 7.0 14 4.1 14 8.9 999.9 81.3* 62.4*
035623 99999 19720208 43.9 24 36.8 24 9999.9 0 9999.9 0 3.4 24 9.4 24 19.0 999.9 50.0* 37.4* 99.99 999.9 110000




2 Reduce 过程




3 MapReduce的开发


import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobContext;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
* 分析最高温度Mapper类
* @author lanstonwu
public class TemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, DoubleWritable> {
static enum MyCounters {
private final double MISSING = 999.9; private String mapTaskId;
private String inputFile;
private int noRecords = 0;
// 获取作业信息
public void configure(JobConf job) {
mapTaskId = job.get(JobContext.TASK_ATTEMPT_ID);
inputFile = job.get(JobContext.MAP_INPUT_FILE);
} public void map(LongWritable offset, Text input, OutputCollector<Text, DoubleWritable> output, Reporter reporter)
throws IOException {
String line = input.toString();//将输入转换为字符
String yearStr = line.substring(14, 20), //截取年月字符
tempStr = line.substring(25,30); // 截取温度字符
double maxTemp = 0; ++noRecords;
// Increment counters
reporter.incrCounter(MyCounters.NUM_RECORDS, 1); // 更新作业状态信息
if ((noRecords % 100) == 0) {
reporter.setStatus(mapTaskId + " processed " + noRecords + " from input-file: " + inputFile);
if (!tempStr.matches("^([^A-Za-z]*?[A-Z][A-Za-z]*?)+.?")) {//匹配非字符情况时进行下面的操作
maxTemp = Double.parseDouble(tempStr);
if (maxTemp != MISSING)
output.collect(new Text(yearStr), new DoubleWritable(maxTemp));


import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter; public class TemperatureReduce extends MapReduceBase implements Reducer<Text ,DoubleWritable, Text , DoubleWritable>{
public void reduce(Text key, Iterator<DoubleWritable> values, OutputCollector<Text, DoubleWritable> output,Reporter reporter) throws IOException {
double maxVal = 0;
while (values.hasNext()){
output.collect(key,new DoubleWritable(maxVal));

Reduce 类实现Reducer的reduce函数,函数有4个参数,第一个key表示键,即从mapper函数output中传递过来的键;第二个values表示值,即mapper函数output中传递过来的value,第三个output表示输出,即结果输出;第四个reporter表示对作业状态的处理;

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.io.Text;
import com.sywu.hadoop.mapper.TemperatureMapper;
import com.sywu.hadoop.reduce.TemperatureReduce; public class TemperatureMain {
public static void main(String[] args) {
if (args.length != 2) {
System.err.print("参数传入错误!使用示例: WordCount <输入路径> <结果输出路径>");
} JobConf jobConf = new JobConf();
// 设置输入路径
FileInputFormat.addInputPath(jobConf, new Path(args[0]));
// 设置输出路径
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
// 设置键输出格式
// 设置键值输出格式
try {
} catch (IOException e) {


4 运行MapReduce


hadoop jar /tmp/myhadoop-1.0-SNAPSHOT.jar com.sywu.hadoop.main.TemperatureMain /ncdc_year_gz/gsod_1972.gz /tmp/result/02


17/10/02 18:44:00 INFO client.RMProxy: Connecting to ResourceManager at gp-sdw1/
17/10/02 18:44:01 INFO client.RMProxy: Connecting to ResourceManager at gp-sdw1/
17/10/02 18:44:02 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/10/02 18:44:03 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/02 18:44:04 INFO mapreduce.JobSubmitter: number of splits:1
17/10/02 18:44:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506922345100_0016
17/10/02 18:44:06 INFO impl.YarnClientImpl: Submitted application application_1506922345100_0016
17/10/02 18:44:06 INFO mapreduce.Job: The url to track the job: http://gp-sdw1:8088/proxy/application_1506922345100_0016/
17/10/02 18:44:06 INFO mapreduce.Job: Running job: job_1506922345100_0016
17/10/02 18:44:31 INFO mapreduce.Job: Job job_1506922345100_0016 running in uber mode : false
17/10/02 18:44:31 INFO mapreduce.Job: map 0% reduce 0%
17/10/02 18:44:48 INFO mapreduce.Job: map 100% reduce 0%
17/10/02 18:45:04 INFO mapreduce.Job: map 100% reduce 100%
17/10/02 18:45:08 INFO mapreduce.Job: Job job_1506922345100_0016 completed successfully
17/10/02 18:45:09 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=3420746
FILE: Number of bytes written=7078435
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=4556912
HDFS: Number of bytes written=148
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=13755
Total time spent by all reduces in occupied slots (ms)=13803
Total time spent by all map tasks (ms)=13755
Total time spent by all reduce tasks (ms)=13803
Total vcore-milliseconds taken by all map tasks=13755
Total vcore-milliseconds taken by all reduce tasks=13803
Total megabyte-milliseconds taken by all map tasks=14085120
Total megabyte-milliseconds taken by all reduce tasks=14134272
Map-Reduce Framework
Map input records=201807
Map output records=201220
Map output bytes=3018300
Map output materialized bytes=3420746
Input split bytes=97
Combine input records=0
Combine output records=0
Reduce input groups=12
Reduce shuffle bytes=3420746
Reduce input records=201220
Reduce output records=12
Spilled Records=402440
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=597
CPU time spent (ms)=11880
Physical memory (bytes) snapshot=453296128
Virtual memory (bytes) snapshot=4201644032
Total committed heap usage (bytes)=298319872
Shuffle Errors
File Input Format Counters
Bytes Read=4556815
File Output Format Counters
Bytes Written=148

日志记录作业名,输入文件信息(Total input paths to process),分片信息(number of splits),跟踪作业运行情况的url(The url to track the job)通过这个URL可以查看到作业运行情况,如果在map和reduce函数中有开发reporter,实时的状态信息可以在这里查看到,如果hadoop未启用historyserver这些信息和url访问将在作业结束时丢失;其它的还有map和reduce完成比率和Counters信息.

5 查看结果

$ hadoop fs -ls /tmp/result/02/
Found 2 items
-rw-r--r-- 3 hadoop supergroup 0 2017-10-02 18:45 /tmp/result/02/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 148 2017-10-02 18:45 /tmp/result/02/part-00000


$ hadoop fs -cat /tmp/result/02/part-00000
197201 96.3
197202 99.1
197203 91.6
197204 94.2
197205 92.1
197206 102.4
197207 106.8
197208 107.0
197209 98.0
197210 94.1
197211 98.8
197212 102.6

6 总结
