大数据：Map终结和Spill文件合并

当Mapper没有数据输入，mapper.run中的while循环会调用context.nextKeyValue就返回false，于是便返回到runNewMapper中，在这里程序会关闭输入通道和输出通道，这里关闭输出通道并没有关闭collector，必须要先flush一下。

获取更多大数据视频资料请加QQ群：947967114 代码结构：

Maptask.runNewMapper->NewOutputCollector.close->MapOutputBuffer.flush

我们看flush帮我们做了什么事情，为什么要flush。

public void flush() throws IOException, ClassNotFoundException,

InterruptedException {

LOG.info("Starting flush of map output");

spillLock.lock();

try {

while (spillInProgress) {

reporter.progress();

spillDone.await();

//这里查看spillInProgress状态，如果有spill就等待完成，并且报告状态。

}

checkSpillException();

final int kvbend = 4 * kvend;

//kvend是元数据块的终点，元数据是向下伸展的。

//kvend是以整数计的数组下标，kvbend是以字节计的数组下标

if ((kvbend + METASIZE) % kvbuffer.length !=

equator - (equator % METASIZE)) {

//这个条件说明缓冲区中原来有数据，现在spill已经完成，需要释放空间。获取更多大数据视频资料请加QQ群：947967114

// spill finished

//spill一次需要调整一些参数，以释放空间，这个工作通过resetSpill完成

resetSpill();

private void resetSpill() {

final int e = equator;

bufstart = bufend = e;

final int aligned = e - (e % METASIZE);

// set start/end to point to first meta record

// Cast one of the operands to long to avoid integer overflow

kvstart = kvend = (int)

(((long)aligned - METASIZE + kvbuffer.length) % kvbuffer.length) / 4;

LOG.info("(RESET) equator " + e + " kv " + kvstart + "(" +

(kvstart * 4) + ")" + " kvi " + kvindex + "(" + (kvindex * 4) + ")");

}

//这里其实就是在调整各个参数的位置。比如原点位，kvstart等。

}

if (kvindex != kvend) {

//再来判断缓冲区是否为空，如果不空表示不满足spill条件（80%），但map处理完成没有数据输入。

kvend = (kvindex + NMETA) % kvmeta.capacity();

bufend = bufmark;

LOG.info("Spilling map output");

LOG.info("bufstart = " + bufstart + "; bufend = " + bufmark +

"; bufvoid = " + bufvoid);

LOG.info("kvstart = " + kvstart + "(" + (kvstart * 4) +

"); kvend = " + kvend + "(" + (kvend * 4) +

"); length = " + (distanceTo(kvend, kvstart,

kvmeta.capacity()) + 1) + "/" + maxRec);

sortAndSpill();

//调用一次sortAndSpill过程。获取更多大数据视频资料请加QQ群：947967114

}

} catch (InterruptedException e) {

throw new IOException("Interrupted while waiting for the writer", e);

} finally {

spillLock.unlock();

}

//至此所有数据都已经溢写出去，缓冲区已空，所有数据都spill到文件中

assert !spillLock.isHeldByCurrentThread();

// shut down spill thread and wait for it to exit. Since the preceding

// ensures that it is finished with its work (and sortAndSpill did not

// throw), we elect to use an interrupt instead of setting a flag.

// Spilling simultaneously from this thread while the spill thread

// finishes its work might be both a useful way to extend this and also

// sufficient motivation for the latter approach.

try {

spillThread.interrupt();

//让spill线程不在运行

spillThread.join();

//结束spill线程

} catch (InterruptedException e) {

throw new IOException("Spill failed", e);

}

// release sort buffer before the merge

kvbuffer = null;

mergeParts();

//合并spill文件

Path outputPath = mapOutputFile.getOutputFile();

fileOutputByteCounter.increment(rfs.getFileStatus(outputPath).getLen());

}

flush的目的，首先让缓冲区的所有KV对数据都进入spill文件，因为每次spill都会产生一个spill文件，所有spill文件可能不止一个，所以要把spill文件合并到单个文件中，分发给reduce。

所以如果有spill正在进行必须等待其完成，也可能没有spill但是缓冲区非空，需要再一次sortAndSpill，总之要把缓冲区清空为止。所有数据都spill完成后就可以进行mergeParts了

代码结构：

Maptask.runNewMapper--->NewOutputCollector.close--->MapOutputBuffer.flush--->MapOutputBuffer.mergeParts

源代码如下：

private void mergeParts() throws IOException, InterruptedException, ClassNotFoundException {

// get the approximate size of the final output/index files

long finalOutFileSize = 0;

long finalIndexFileSize = 0;

final Path[] filename = new Path[numSpills];

//每次溢写都会有一个文件，所以数组的大小是numSpills。获取更多大数据视频资料请加QQ群：947967114

final TaskAttemptID mapId = getTaskID();

for(int i = 0; i < numSpills; i++) {

//统计所有这些文件合并之后的大小

filename[i] = mapOutputFile.getSpillFile(i);

//通过spill文件的编号获取到指定的spill文件路径

finalOutFileSize += rfs.getFileStatus(filename[i]).getLen();//获取文件大小

}

if (numSpills == 1) {

//合并输出有俩文件一个是output/file.out，一个是output/file.out.index

sameVolRename(filename[0],

mapOutputFile.getOutputFileForWriteInVolume(filename[0]));

//换个文件名，在原文件名上加个file.out

if (indexCacheList.size() == 0) {

//索引块缓存indexCacheList已空

sameVolRename(mapOutputFile.getSpillIndexFile(0), mapOutputFile.getOutputIndexFileForWriteInVolume(filename[0]));//spillIndexFile改名。

} else {

//索引块缓存indexCacheList中还有索引记录，要写到索引文件

indexCacheList.get(0).writeToFile(

//写入文件

mapOutputFile.getOutputIndexFileForWriteInVolume(filename[0]), job);

}

sortPhase.complete();

return;

//如果只有一个spill合并已经完成。获取更多大数据视频资料请加QQ群：947967114

}

// read in paged indices

for (int i = indexCacheList.size(); i < numSpills; ++i) {

//如果spill文件不止一个，需要合并

Path indexFileName = mapOutputFile.getSpillIndexFile(i);

indexCacheList.add(new SpillRecord(indexFileName, job));

//先把所有的SpillIndexFile收集在一起。

}

//make correction in the length to include the sequence file header

//lengths for each partition

finalOutFileSize += partitions * APPROX_HEADER_LENGTH;

//每个partition都有header

finalIndexFileSize = partitions * MAP_OUTPUT_INDEX_RECORD_LENGTH;

//IndexFile，每个partition一个记录。

Path finalOutputFile =

mapOutputFile.getOutputFileForWrite(finalOutFileSize);

Path finalIndexFile =

mapOutputFile.getOutputIndexFileForWrite(finalIndexFileSize);

//The output stream for the final single output file

FSDataOutputStream finalOut = rfs.create(finalOutputFile, true, 4096);

//创建合并，最终输出。

if (numSpills == 0) {

//要是没有SipillFile生成，也创建一个空文件

//create dummy files

IndexRecord rec = new IndexRecord();

//创建索引记录

SpillRecord sr = new SpillRecord(partitions);

//创建spill记录

try {

for (int i = 0; i < partitions; i++) {

long segmentStart = finalOut.getPos();

FSDataOutputStream finalPartitionOut = CryptoUtils.wrapIfNecessary(job, finalOut);

Writer<K, V> writer =

new Writer<K, V>(job, finalPartitionOut, keyClass, valClass, codec, null);

writer.close();

//创建后马上关闭，形成空文件。

rec.startOffset = segmentStart;

rec.rawLength = writer.getRawLength() + CryptoUtils.cryptoPadding(job);

rec.partLength = writer.getCompressedLength() + CryptoUtils.cryptoPadding(job);

sr.putIndex(rec, i);

}

sr.writeToFile(finalIndexFile, job);

//所以记录写入索引文件

} finally {

finalOut.close();

}

sortPhase.complete();

return;

}

{

sortPhase.addPhases(partitions); // Divide sort phase into sub-phases

IndexRecord rec = new IndexRecord();

final SpillRecord spillRec = new SpillRecord(partitions);

for (int parts = 0; parts < partitions; parts++) {

//finalOut最终输出文件。循环分区获得所有spill文件的该分区数据，合并写入finalOut

//create the segments to be merged

List<Segment<K,V>> segmentList =

new ArrayList<Segment<K, V>>(numSpills);

//创建Segment，数据段

for(int i = 0; i < numSpills; i++) {

//准备合并所有的Spill文件

IndexRecord indexRecord = indexCacheList.get(i).getIndex(parts);

Segment<K,V> s =

new Segment<K,V>(job, rfs, filename[i], indexRecord.startOffset,

indexRecord.partLength, codec, true);

segmentList.add(i, s);

//把每个Spill文件中相同partition的区段位置收集起来。获取更多大数据视频资料请加QQ群：947967114

if (LOG.isDebugEnabled()) {

LOG.debug("MapId=" + mapId + " Reducer=" + parts +

"Spill =" + i + "(" + indexRecord.startOffset + "," +

indexRecord.rawLength + ", " + indexRecord.partLength + ")");

}

int mergeFactor = job.getInt(JobContext.IO_SORT_FACTOR, 100);

//做merge操作时同时操作的stream数上限

boolean sortSegments = segmentList.size() > mergeFactor;

//对segment进行排序

@SuppressWarnings("unchecked")

RawKeyValueIterator kvIter = Merger.merge(job, rfs,

keyClass, valClass, codec,

segmentList, mergeFactor,

new Path(mapId.toString()),

job.getOutputKeyComparator(), reporter, sortSegments,

null, spilledRecordsCounter, sortPhase.phase(),

TaskType.MAP);

//合并同一partition在所有spill文件中的内容，可能还需要sort，合并后的结构是一个序列。

//write merged output to disk

long segmentStart = finalOut.getPos();

FSDataOutputStream finalPartitionOut = CryptoUtils.wrapIfNecessary(job, finalOut);

Writer<K, V> writer =

new Writer<K, V>(job, finalPartitionOut, keyClass, valClass, codec,

spilledRecordsCounter);

if (combinerRunner == null || numSpills < minSpillsForCombine) { // minSpillsForCombine在MapOutputBuffer构造函数内被初始化，numSpills 为mapTask已经溢写到磁盘spill文件数量

Merger.writeFile(kvIter, writer, reporter, job);

//将合并后的结果直接写入文件。下面看一下writeFile的源代码;

public static <K extends Object, V extends Object>

void writeFile(RawKeyValueIterator records, Writer<K, V> writer,

Progressable progressable, Configuration conf)

throws IOException {

long progressBar = conf.getLong(JobContext.RECORDS_BEFORE_PROGRESS,

10000);

long recordCtr = 0;

while(records.next()) {

writer.append(records.getKey(), records.getValue());

//追加的方式输出到writer中

if (((recordCtr++) % progressBar) == 0) {

progressable.progress();

}

回到主代码：

} else {

//有combiner

combineCollector.setWriter(writer);

//就插入combiner环节

combinerRunner.combine(kvIter, combineCollector);

//将合并的结果经过combiner后写入文件

}

//close

writer.close();//关闭writer通道

sortPhase.startNextPhase();

// record offsets

rec.startOffset = segmentStart;

//从当前段的起点开始

rec.rawLength = writer.getRawLength() + CryptoUtils.cryptoPadding(job);

rec.partLength = writer.getCompressedLength() + CryptoUtils.cryptoPadding(job);

spillRec.putIndex(rec, parts);

}

spillRec.writeToFile(finalIndexFile, job);

//把spillFile写入合并的indexFle

finalOut.close();

//关闭最终输出流

for(int i = 0; i < numSpills; i++) {

rfs.delete(filename[i],true);

//删除所有spill文件

}

该方法会将所有临时文件合并成一个大文件保存到output/file.out中，同时生成相应的索引文件output/file.out.index。在进行文件合并的过程中，Map Task以分区为单位进行合并。对于某个分区，它将采用多轮递归合并的方式：每轮合并io.sort.factor，默认是100，个文件，并将产生的文件重新加入待合并列表中，对文件排序后，重复上述过程，直到只有一个文件。只生产一个文件可以避免同时打开大量的文件和同时读取大量的小文件产生的随机读取带来的开销。最后会删除所有的spill文件。

　　另外需要注意的是，mergeParts()中也有combiner的操作，但是需要满足一定的条件：1、用户设置了combiner；2、spill文件的数量超过了minSpillsForCombine的值，对应配置项"min.num.spills.for.combine"，可自行设置，默认是3。这俩必须同时具备才会在此启动combiner的本地聚集操作。所以在Map阶段有可能combiner会执行两次，所以有可能你的combiner执行两次之后输出数据不符合预期了。

　　这样Map阶段的任务就算完成了。主要是读取数据然后写入内存缓冲区，缓存区满足条件就会快排后并设置partition后，spill到本地文件和索引文件；如果有combiner，spill之前也会做一次聚集操作，待数据跑完会通过归并合并所有spill文件和索引文件，如果有combiner，合并之前在满足条件后会做一次综合的聚集操作。map阶段的结果都会存储在本地中(如果有reducer的话)，非HDFS。

Mapper完成对所有输入文件的处理，并将缓冲区的数据写出到spill文件之后，spill文件的存在只有三种可能：没有spill，一个spill，多个spill。针对这三种都需要一个最终的输出文件，不管内容有没有，内容多少。这个最终文件是和单个spill文件是一样的，按照partition分成若干段，然后是排好序的KV数据，这个merge操作结合之前的spill文件进行sort。就构成了一次mergeSort，这个mergeSort只针对同一个Mapper的多个spill文件，以后在Reducer那里还会有Merge针对不同的Mapper文件。

当Maptask完成后，从runNewMapper返回，下一个操作就是done。也就是MapTask的收尾工作。MapTask的收尾涉及到怎么把生成的数据输出交给ReduceTask。MapTask和ReduceTask都是扩展自Task。但是他们都没有自己定义done函数，所以他们都调用了Task的done。

程序在这里跳出runNewMapper 获取更多大数据视频资料请加QQ群：947967114

if (useNewApi) {

runNewMapper(job, splitMetaInfo, umbilical, reporter);

} else {

runOldMapper(job, splitMetaInfo, umbilical, reporter);

}

done(umbilical, reporter);

这个done我们点进去后发现是Task.done，源码如下;

public void done(TaskUmbilicalProtocol umbilical,

TaskReporter reporter

) throws IOException, InterruptedException {

LOG.info("Task:" + taskId + " is done."

+ " And is in the process of committing");

updateCounters();

//更新容器

boolean commitRequired = isCommitRequired();

if (commitRequired) {

int retries = MAX_RETRIES;

setState(TaskStatus.State.COMMIT_PENDING);

// say the task tracker that task is commit pending

while (true) {

try {

umbilical.commitPending(taskId, taskStatus);

break;

//如果commitPending没有发生异常，就退出，否则重试。

} catch (InterruptedException ie) {

// ignore

} catch (IOException ie) {

LOG.warn("Failure sending commit pending: " +

StringUtils.stringifyException(ie));

if (--retries == 0) {

System.exit(67);

}

//wait for commit approval and commit

commit(umbilical, reporter, committer);

}

taskDone.set(true);

reporter.stopCommunicationThread();

// Make sure we send at least one set of counter increments. It's

// ok to call updateCounters() in this thread after comm thread stopped.

updateCounters();

sendLastUpdate(umbilical);

//signal the tasktracker that we are done

sendDone(umbilical);

实现sendDone的源代码：

private void sendDone(TaskUmbilicalProtocol umbilical) throws IOException {

int retries = MAX_RETRIES;

while (true) {

try {

umbilical.done(getTaskID());

//实际上这里向MRAppMaster上的TaskAttemptImpl发送TA_DONE事件

LOG.info("Task '" + taskId + "' done.");

return;

} catch (IOException ie) {

LOG.warn("Failure signalling completion: " +

StringUtils.stringifyException(ie));

if (--retries == 0) {

throw ie;

}

umbilical.done(getTaskID()); 获取更多大数据视频资料请加QQ群：947967114

//实际上这里向MRAppMaster上的TaskAttemptImpl发送TA_DONE事件，在TA_DONE事件的驱动下，相应的TaskAttemptImpl对象的状态机执行CleanupContainerTransition.transition，然后转入SUCCESS_CONTAINER_CLEANUP状态。注意这里有一个TaskAttemptEventType.TA_DONE事件是由具体的MapTask所在节点上发出的，但不是引起的状态机的跳变是在MRAppMaster节点上。对于Maptask，会有一个umbilical，就代表着MRAppMaster。

MPAppmaster接到CONTAINER_REMOTE_CLEANUP事件，ContainerLauncher通过RPC机制调用Maptask所在节点的ContainerManagerImpl.stopContainers.使这个MapTask的容器进入KILLED_BY_APPMASTER状态从而不在活跃。操作成功后向相应的TaskAttemptImpl发送TO_CONTAINER_CLEANED事件。如果一次TaskAttempt成功了，就意味着尝试的任务也成功了，所以TaskAttempt的状态关系到TaskImpl对象，taskImpl的扫描和善后，包括向上层的JobImpl对象发送TaskState.SUCCESSED事件。向自身TaskImpl发送的SUCCESSED事件会导致TaskImpl.handleTaskAttemptCompletion操作。

Mapper节点上产生一个过程setMapOutputServerAdress函数，把本节点的MapOutputServer地址设置成一个Web地址，意味着MapTask留下的数据输出（合并后的spill文件）可以通过HTTP连接获取。至此Mapper的所有过程完成。获取更多大数据视频资料请加QQ群：947967114