Shuffle过程主要分为Shuffle write和Shuffle read两个阶段，2.0版本之后hash shuffle被删除，只保留sort shuffle，下面结合代码分析：

1.ShuffleManager

Spark在初始化SparkEnv的时候，会在create()方法里面初始化ShuffleManager

// Let the user specify short names for shuffle managers

    val shortShuffleMgrNames = Map(

      "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,

      "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)

    val shuffleMgrName = conf.get(config.SHUFFLE_MANAGER)

    val shuffleMgrClass =

      shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)

    val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

这里可以看到包含sort和tungsten-sort两种shuffle，通过反射创建了ShuffleManager，ShuffleManager是一个特质，核心方法有下面几个：

private[spark] trait ShuffleManager {

  /**

   * 注册一个shuffle返回句柄

   */

  def registerShuffle[K, V, C](

      shuffleId: Int,

      dependency: ShuffleDependency[K, V, C]): ShuffleHandle

  /** 获取一个Writer根据给定的分区，在executors执行map任务时被调用 */

  def getWriter[K, V](

      handle: ShuffleHandle,

      mapId: Long,

      context: TaskContext,

      metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V]

  /**

   * 获取一个Reader根据reduce分区的范围，在executors执行reduce任务时被调用

   */

  def getReader[K, C](

      handle: ShuffleHandle,

      startPartition: Int,

      endPartition: Int,

      context: TaskContext,

      metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]

	...

}

2.SortShuffleManager

SortShuffleManager是ShuffleManager的唯一实现类，对于以上三个方法的实现如下：

2.1 registerShuffle

/**

   * Obtains a [[ShuffleHandle]] to pass to tasks.

   */

  override def registerShuffle[K, V, C](

      shuffleId: Int,

      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {

    // 1.首先检查是否符合BypassMergeSort

    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {

      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't

      // need map-side aggregation, then write numPartitions files directly and just concatenate

      // them at the end. This avoids doing serialization and deserialization twice to merge

      // together the spilled files, which would happen with the normal code path. The downside is

      // having multiple files open at a time and thus more memory allocated to buffers.

      new BypassMergeSortShuffleHandle[K, V](

        shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])

      // 2.否则检查是否能够序列化

    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {

      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:

      new SerializedShuffleHandle[K, V](

        shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])

    } else {

      // Otherwise, buffer map outputs in a deserialized form:

      new BaseShuffleHandle(shuffleId, dependency)

    }

  }

1.首先检查是否符合BypassMergeSort，这里需要满足两个条件，首先是当前shuffle依赖中没有map端的聚合操作，其次是分区数要小于spark.shuffle.sort.bypassMergeThreshold的值，默认为200，如果满足这两个条件，会返回BypassMergeSortShuffleHandle，启用bypass merge-sort shuffle机制

def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {

  // We cannot bypass sorting if we need to do map-side aggregation.

  if (dep.mapSideCombine) {

    false

  } else {

    // 默认值为200

    val bypassMergeThreshold: Int = conf.get(config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)

    dep.partitioner.numPartitions <= bypassMergeThreshold

  }

}

2.如果不满足上面条件，检查是否满足canUseSerializedShuffle()方法，如果满足该方法中的3个条件，则会返回SerializedShuffleHandle，启用tungsten-sort shuffle机制

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {

  val shufId = dependency.shuffleId

  val numPartitions = dependency.partitioner.numPartitions

  // 序列化器需要支持Relocation

  if (!dependency.serializer.supportsRelocationOfSerializedObjects) {

    log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +

      s"${dependency.serializer.getClass.getName}, does not support object relocation")

    false

    // 不能有map端聚合操作

  } else if (dependency.mapSideCombine) {

    log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +

      s"map-side aggregation")

    false

    // 分区数不能大于16777215+1

  } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {

    log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +

      s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")

    false

  } else {

    log.debug(s"Can use serialized shuffle for shuffle $shufId")

    true

  }

}

3.如果以上两个条件都不满足的话，会返回BaseShuffleHandle，采用基本sort shuffle机制

2.2 getReader

/**

 * Get a reader for a range of reduce partitions (startPartition to endPartition-1, inclusive).

 * Called on executors by reduce tasks.

 */

override def getReader[K, C](

    handle: ShuffleHandle,

    startPartition: Int,

    endPartition: Int,

    context: TaskContext,

    metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C] = {

  val blocksByAddress = SparkEnv.get.mapOutputTracker.getMapSizesByExecutorId(

    handle.shuffleId, startPartition, endPartition)

  new BlockStoreShuffleReader(

    handle.asInstanceOf[BaseShuffleHandle[K, _, C]], blocksByAddress, context, metrics,

    shouldBatchFetch = canUseBatchFetch(startPartition, endPartition, context))

}

这里返回BlockStoreShuffleReader

2.3 getWriter

/** Get a writer for a given partition. Called on executors by map tasks. */

override def getWriter[K, V](

    handle: ShuffleHandle,

    mapId: Long,

    context: TaskContext,

    metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V] = {

  val mapTaskIds = taskIdMapsForShuffle.computeIfAbsent(

    handle.shuffleId, _ => new OpenHashSet[Long](16))

  mapTaskIds.synchronized { mapTaskIds.add(context.taskAttemptId()) }

  val env = SparkEnv.get

  // 根据handle获取不同ShuffleWrite

  handle match {

    case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>

      new UnsafeShuffleWriter(

        env.blockManager,

        context.taskMemoryManager(),

        unsafeShuffleHandle,

        mapId,

        context,

        env.conf,

        metrics,

        shuffleExecutorComponents)

    case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>

      new BypassMergeSortShuffleWriter(

        env.blockManager,

        bypassMergeSortHandle,

        mapId,

        env.conf,

        metrics,

        shuffleExecutorComponents)

    case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>

      new SortShuffleWriter(

        shuffleBlockResolver, other, mapId, context, shuffleExecutorComponents)

  }

}

这里会根据handle获取不同ShuffleWrite，如果是SerializedShuffleHandle，使用UnsafeShuffleWriter，如果是BypassMergeSortShuffleHandle，采用BypassMergeSortShuffleWriter，否则使用SortShuffleWriter

3.三种Writer的实现

如上文所说，当开启bypass机制后，会使用BypassMergeSortShuffleWriter，如果serializer支持relocation并且map端没有聚合同时分区数目不大于16777215+1三个条件都满足，使用UnsafeShuffleWriter，否则使用SortShuffleWriter

3.1 BypassMergeSortShuffleWriter

BypassMergeSortShuffleWriter继承ShuffleWriter，用java实现，会将map端的多个输出文件合并为一个文件，同时生成一个索引文件，索引记录到每个分区的初始地址，write()方法如下：

@Override

public void write(Iterator<Product2<K, V>> records) throws IOException {

  assert (partitionWriters == null);

  // 新建一个ShuffleMapOutputWriter

  ShuffleMapOutputWriter mapOutputWriter = shuffleExecutorComponents

  .createMapOutputWriter(shuffleId, mapId, numPartitions);

  try {

    // 如果没有数据的话

    if (!records.hasNext()) {

      // 返回所有分区的写入长度

      partitionLengths = mapOutputWriter.commitAllPartitions();

      // 更新mapStatus

      mapStatus = MapStatus$.MODULE$.apply(

        blockManager.shuffleServerId(), partitionLengths, mapId);

      return;

    }

    final SerializerInstance serInstance = serializer.newInstance();

    final long openStartTime = System.nanoTime();

    // 创建和分区数相等的DiskBlockObjectWriter FileSegment

    partitionWriters = new DiskBlockObjectWriter[numPartitions];

    partitionWriterSegments = new FileSegment[numPartitions];

    // 对于每个分区

    for (int i = 0; i < numPartitions; i++) {

      // 创建一个临时的block

      final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =

      blockManager.diskBlockManager().createTempShuffleBlock();

      // 获取temp block的file和id

      final File file = tempShuffleBlockIdPlusFile._2();

      final BlockId blockId = tempShuffleBlockIdPlusFile._1();

      // 对于每个分区，创建一个DiskBlockObjectWriter

      partitionWriters[i] =

      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);

    }

    // Creating the file to write to and creating a disk writer both involve interacting with

    // the disk, and can take a long time in aggregate when we open many files, so should be

    // included in the shuffle write time.

    // 创建文件和写入文件都需要大量时间，也需要包含在shuffle写入时间里面

    writeMetrics.incWriteTime(System.nanoTime() - openStartTime);

    // 如果有数据的话

    while (records.hasNext()) {

      final Product2<K, V> record = records.next();

      final K key = record._1();

      // 对于每条数据按key写入相应分区对应的文件

      partitionWriters[partitioner.getPartition(key)].write(key, record._2());

    }

    for (int i = 0; i < numPartitions; i++) {

      try (DiskBlockObjectWriter writer = partitionWriters[i]) {

        // 提交

        partitionWriterSegments[i] = writer.commitAndGet();

      }

    }

    // 将所有分区文件合并成一个文件

    partitionLengths = writePartitionedData(mapOutputWriter);

    // 更新mapStatus

    mapStatus = MapStatus$.MODULE$.apply(

      blockManager.shuffleServerId(), partitionLengths, mapId);

  } catch (Exception e) {

    try {

      mapOutputWriter.abort(e);

    } catch (Exception e2) {

      logger.error("Failed to abort the writer after failing to write map output.", e2);

      e.addSuppressed(e2);

    }

    throw e;

  }

}

合并文件的方法writePartitionedData()如下，默认采用零拷贝的方式来合并文件：

private long[] writePartitionedData(ShuffleMapOutputWriter mapOutputWriter) throws IOException {

  // Track location of the partition starts in the output file

  if (partitionWriters != null) {

    // 开始时间

    final long writeStartTime = System.nanoTime();

    try {

      for (int i = 0; i < numPartitions; i++) {

        // 获取每个文件

        final File file = partitionWriterSegments[i].file();

        ShufflePartitionWriter writer = mapOutputWriter.getPartitionWriter(i);

        if (file.exists()) {

          // 采取零拷贝方式

          if (transferToEnabled) {

            // Using WritableByteChannelWrapper to make resource closing consistent between

            // this implementation and UnsafeShuffleWriter.

            Optional<WritableByteChannelWrapper> maybeOutputChannel = writer.openChannelWrapper();

            // 在这里会调用Utils.copyFileStreamNIO方法，最终调用FileChannel.transferTo方法拷贝文件

            if (maybeOutputChannel.isPresent()) {

              writePartitionedDataWithChannel(file, maybeOutputChannel.get());

            } else {

              writePartitionedDataWithStream(file, writer);

            }

          } else {

            // 否则采取流的方式拷贝

            writePartitionedDataWithStream(file, writer);

          }

          if (!file.delete()) {

            logger.error("Unable to delete file for partition {}", i);

          }

        }

      }

    } finally {

      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);

    }

    partitionWriters = null;

  }

  return mapOutputWriter.commitAllPartitions();

}

3.2 UnsafeShuffleWriter

UnsafeShuffleWriter也是继承ShuffleWriter，用java实现，write方法如下：

@Override

public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {

  // Keep track of success so we know if we encountered an exception

  // We do this rather than a standard try/catch/re-throw to handle

  // generic throwables.

  // 跟踪异常

  boolean success = false;

  try {

    while (records.hasNext()) {

      // 将数据插入ShuffleExternalSorter进行外部排序

      insertRecordIntoSorter(records.next());

    }

    // 合并并输出文件

    closeAndWriteOutput();

    success = true;

  } finally {

    if (sorter != null) {

      try {

        sorter.cleanupResources();

      } catch (Exception e) {

        // Only throw this error if we won't be masking another

        // error.

        if (success) {

          throw e;

        } else {

          logger.error("In addition to a failure during writing, we failed during " +

                       "cleanup.", e);

        }

      }

    }

  }

}

这里主要有两个方法：

3.2.1 insertRecordIntoSorter()

@VisibleForTesting

void insertRecordIntoSorter(Product2<K, V> record) throws IOException {

  assert(sorter != null);

  // 获取key和分区

  final K key = record._1();

  final int partitionId = partitioner.getPartition(key);

  // 重置缓冲区

  serBuffer.reset();

  // 将key和value写入缓冲区

  serOutputStream.writeKey(key, OBJECT_CLASS_TAG);

  serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);

  serOutputStream.flush();

  // 获取序列化数据大小

  final int serializedRecordSize = serBuffer.size();

  assert (serializedRecordSize > 0);

  // 将序列化后的数据插入ShuffleExternalSorter处理

  sorter.insertRecord(

    serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);

}

该方法会将数据进行序列化，并且将序列化后的数据通过insertRecord()方法插入外部排序器中，insertRecord()方法如下：

public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)

  throws IOException {

  // for tests

  assert(inMemSorter != null);

  // 如果数据条数超过溢写阈值，直接溢写磁盘

  if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {

    logger.info("Spilling data because number of spilledRecords crossed the threshold " +

      numElementsForSpillThreshold);

    spill();

  }

  // Checks whether there is enough space to insert an additional record in to the sort pointer

  // array and grows the array if additional space is required. If the required space cannot be

  // obtained, then the in-memory data will be spilled to disk.

  // 检查是否有足够的空间插入额外的记录到排序指针数组中，如果需要额外的空间对数组进行扩容，如果空间不够，内存中的数据将会被溢写到磁盘上

  growPointerArrayIfNecessary();

  final int uaoSize = UnsafeAlignedOffset.getUaoSize();

  // Need 4 or 8 bytes to store the record length.

  // 需要额外的4或8个字节存储数据长度

  final int required = length + uaoSize;

  // 如果需要更多的内存，会想TaskMemoryManager申请新的page

  acquireNewPageIfNecessary(required);

  assert(currentPage != null);

  final Object base = currentPage.getBaseObject();

  //Given a memory page and offset within that page, encode this address into a 64-bit long.

  //This address will remain valid as long as the corresponding page has not been freed.

  // 通过给定的内存页和偏移量，将当前数据的逻辑地址编码成一个long型

  final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);

  // 写长度值

  UnsafeAlignedOffset.putSize(base, pageCursor, length);

  // 移动指针

  pageCursor += uaoSize;

  // 写数据

  Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);

  // 移动指针

  pageCursor += length;

  // 将编码的逻辑地址和分区id传给ShuffleInMemorySorter进行排序

  inMemSorter.insertRecord(recordAddress, partitionId);

}

在这里对于数据的缓存和溢写不借助于其他高级数据结构，而是直接操作内存空间

growPointerArrayIfNecessary()方法如下：

/**

 * Checks whether there is enough space to insert an additional record in to the sort pointer

 * array and grows the array if additional space is required. If the required space cannot be

 * obtained, then the in-memory data will be spilled to disk.

 */

private void growPointerArrayIfNecessary() throws IOException {

  assert(inMemSorter != null);

  // 如果没有空间容纳新的数据

  if (!inMemSorter.hasSpaceForAnotherRecord()) {

    // 获取当前内存使用量

    long used = inMemSorter.getMemoryUsage();

    LongArray array;

    try {

      // could trigger spilling

      // 分配给缓存原来两倍的容量

      array = allocateArray(used / 8 * 2);

    } catch (TooLargePageException e) {

      // The pointer array is too big to fix in a single page, spill.

      // 如果超出了一页的大小，直接溢写，溢写方法见后面

      // 一页的大小为128M，在PackedRecordPointer类中

      // static final int MAXIMUM_PAGE_SIZE_BYTES = 1 << 27;  // 128 megabytes

      spill();

      return;

    } catch (SparkOutOfMemoryError e) {

      // should have trigger spilling

      if (!inMemSorter.hasSpaceForAnotherRecord()) {

        logger.error("Unable to grow the pointer array");

        throw e;

      }

      return;

    }

    // check if spilling is triggered or not

    if (inMemSorter.hasSpaceForAnotherRecord()) {

      // 如果有了剩余空间，则表明没必要扩容，释放分配的空间

      freeArray(array);

    } else {

      // 否则把原来的数组复制到新的数组

      inMemSorter.expandPointerArray(array);

    }

  }

}

spill()方法如下：

@Override

public long spill(long size, MemoryConsumer trigger) throws IOException {

  if (trigger != this || inMemSorter == null || inMemSorter.numRecords() == 0) {

    return 0L;

  }

  logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)",

    Thread.currentThread().getId(),

    Utils.bytesToString(getMemoryUsage()),

    spills.size(),

    spills.size() > 1 ? " times" : " time");

  // Sorts the in-memory records and writes the sorted records to an on-disk file.

  // This method does not free the sort data structures.

  // 对内存中的数据进行排序并且将有序记录写到一个磁盘文件中，这个方法不会释放排序的数据结构

  writeSortedFile(false);

  final long spillSize = freeMemory();

  // 重置ShuffleInMemorySorter

  inMemSorter.reset();

  // Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the

  // records. Otherwise, if the task is over allocated memory, then without freeing the memory

  // pages, we might not be able to get memory for the pointer array.

  taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);

  return spillSize;

}

writeSortedFile()方法：

private void writeSortedFile(boolean isLastFile) {

  // This call performs the actual sort.

  // 返回一个排序好的迭代器

  final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =

    inMemSorter.getSortedIterator();

  // If there are no sorted records, so we don't need to create an empty spill file.

  if (!sortedRecords.hasNext()) {

    return;

  }

  final ShuffleWriteMetricsReporter writeMetricsToUse;

  // 如果为true，则为输出文件，否则为溢写文件

  if (isLastFile) {

    // We're writing the final non-spill file, so we _do_ want to count this as shuffle bytes.

    writeMetricsToUse = writeMetrics;

  } else {

    // We're spilling, so bytes written should be counted towards spill rather than write.

    // Create a dummy WriteMetrics object to absorb these metrics, since we don't want to count

    // them towards shuffle bytes written.

    writeMetricsToUse = new ShuffleWriteMetrics();

  }

  // Small writes to DiskBlockObjectWriter will be fairly inefficient. Since there doesn't seem to

  // be an API to directly transfer bytes from managed memory to the disk writer, we buffer

  // data through a byte array. This array does not need to be large enough to hold a single

  // record;

  // 创建一个字节缓冲数组，大小为1m

  final byte[] writeBuffer = new byte[diskWriteBufferSize];

  // Because this output will be read during shuffle, its compression codec must be controlled by

  // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use

  // createTempShuffleBlock here; see SPARK-3426 for more details.

  // 创建一个临时的shuffle block

  final Tuple2<TempShuffleBlockId, File> spilledFileInfo =

    blockManager.diskBlockManager().createTempShuffleBlock();

  // 获取文件和id

  final File file = spilledFileInfo._2();

  final TempShuffleBlockId blockId = spilledFileInfo._1();

  final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);

  // Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.

  // Our write path doesn't actually use this serializer (since we end up calling the `write()`

  // OutputStream methods), but DiskBlockObjectWriter still calls some methods on it. To work

  // around this, we pass a dummy no-op serializer.

  // 不做任何转换的序列化器，因为需要一个实例来构造DiskBlockObjectWriter

  final SerializerInstance ser = DummySerializerInstance.INSTANCE;

  int currentPartition = -1;

  final FileSegment committedSegment;

  try (DiskBlockObjectWriter writer =

      blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse)) {

    final int uaoSize = UnsafeAlignedOffset.getUaoSize();

    // 遍历

    while (sortedRecords.hasNext()) {

      sortedRecords.loadNext();

      final int partition = sortedRecords.packedRecordPointer.getPartitionId();

      assert (partition >= currentPartition);

      if (partition != currentPartition) {

        // Switch to the new partition

        // 如果切换到了新的分区，提交当前分区，并且记录当前分区大小

        if (currentPartition != -1) {

          final FileSegment fileSegment = writer.commitAndGet();

          spillInfo.partitionLengths[currentPartition] = fileSegment.length();

        }

        // 然后切换到下一个分区

        currentPartition = partition;

      }

      // 获取指针，通过指针获取页号和偏移量

      final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();

      final Object recordPage = taskMemoryManager.getPage(recordPointer);

      final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);

      // 获取剩余数据

      int dataRemaining = UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);

      // 跳过数据前面存储的长度

      long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length

      while (dataRemaining > 0) {

        final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);

        // 将数据拷贝到缓冲数组中

        Platform.copyMemory(

          recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);

        // 从缓冲数组中转入DiskBlockObjectWriter

        writer.write(writeBuffer, 0, toTransfer);

        // 更新位置

        recordReadPosition += toTransfer;

        // 更新剩余数据

        dataRemaining -= toTransfer;

      }

      writer.recordWritten();

    }

    // 提交

    committedSegment = writer.commitAndGet();

  }

  // If `writeSortedFile()` was called from `closeAndGetSpills()` and no records were inserted,

  // then the file might be empty. Note that it might be better to avoid calling

  // writeSortedFile() in that case.

  // 记录溢写文件的列表

  if (currentPartition != -1) {

    spillInfo.partitionLengths[currentPartition] = committedSegment.length();

    spills.add(spillInfo);

  }

  // 如果是溢写文件，更新溢写的指标

  if (!isLastFile) {

    writeMetrics.incRecordsWritten(

      ((ShuffleWriteMetrics)writeMetricsToUse).recordsWritten());

    taskContext.taskMetrics().incDiskBytesSpilled(

      ((ShuffleWriteMetrics)writeMetricsToUse).bytesWritten());

  }

}

encodePageNumberAndOffset()方法如下：

public long encodePageNumberAndOffset(MemoryBlock page, long offsetInPage) {

  // 如果开启了堆外内存，偏移量为绝对地址，可能需要64位进行编码，由于页大小限制，将其减去当前页的基地址，变为相对地址

  if (tungstenMemoryMode == MemoryMode.OFF_HEAP) {

    // In off-heap mode, an offset is an absolute address that may require a full 64 bits to

    // encode. Due to our page size limitation, though, we can convert this into an offset that's

    // relative to the page's base offset; this relative offset will fit in 51 bits.

    offsetInPage -= page.getBaseOffset();

  }

  return encodePageNumberAndOffset(page.pageNumber, offsetInPage);

}

@VisibleForTesting

public static long encodePageNumberAndOffset(int pageNumber, long offsetInPage) {

  assert (pageNumber >= 0) : "encodePageNumberAndOffset called with invalid page";

  // 高13位为页号，低51位为偏移量

  // 页号左移51位，再拼偏移量和上一个低51位都为1的掩码0x7FFFFFFFFFFFFL

  return (((long) pageNumber) << OFFSET_BITS) | (offsetInPage & MASK_LONG_LOWER_51_BITS);

}

ShuffleInMemorySorter的insertRecord()方法如下：

public void insertRecord(long recordPointer, int partitionId) {

  if (!hasSpaceForAnotherRecord()) {

    throw new IllegalStateException("There is no space for new record");

  }

  array.set(pos, PackedRecordPointer.packPointer(recordPointer, partitionId));

  pos++;

}

PackedRecordPointer.packPointer()方法：

public static long packPointer(long recordPointer, int partitionId) {

  assert (partitionId <= MAXIMUM_PARTITION_ID);

  // Note that without word alignment we can address 2^27 bytes = 128 megabytes per page.

  // Also note that this relies on some internals of how TaskMemoryManager encodes its addresses.

  // 将页号右移24位，和低27位拼在一起，这样逻辑地址被压缩成40位

  final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;

  final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);

  // 将分区号放在高24位上

  return (((long) partitionId) << 40) | compressedAddress;

}

getSortedIterator()方法：

public ShuffleSorterIterator getSortedIterator() {

  int offset = 0;

  // 使用基数排序对内存分区ID进行排序。基数排序要快得多，但是在添加指针时需要额外的内存作为保留内存

  if (useRadixSort) {

    offset = RadixSort.sort(

      array, pos,

      PackedRecordPointer.PARTITION_ID_START_BYTE_INDEX,

      PackedRecordPointer.PARTITION_ID_END_BYTE_INDEX, false, false);

    // 否则采用timSort排序

  } else {

    MemoryBlock unused = new MemoryBlock(

      array.getBaseObject(),

      array.getBaseOffset() + pos * 8L,

      (array.size() - pos) * 8L);

    LongArray buffer = new LongArray(unused);

    Sorter<PackedRecordPointer, LongArray> sorter =

      new Sorter<>(new ShuffleSortDataFormat(buffer));

    sorter.sort(array, 0, pos, SORT_COMPARATOR);

  }

  return new ShuffleSorterIterator(pos, array, offset);

}

3.2.2 closeAndWriteOutput()

@VisibleForTesting

void closeAndWriteOutput() throws IOException {

  assert(sorter != null);

  updatePeakMemoryUsed();

  serBuffer = null;

  serOutputStream = null;

  // 获取溢写文件

  final SpillInfo[] spills = sorter.closeAndGetSpills();

  sorter = null;

  final long[] partitionLengths;

  try {

    // 合并溢写文件

    partitionLengths = mergeSpills(spills);

  } finally {

    // 删除溢写文件

    for (SpillInfo spill : spills) {

      if (spill.file.exists() && !spill.file.delete()) {

        logger.error("Error while deleting spill file {}", spill.file.getPath());

      }

    }

  }

  // 更新mapstatus

  mapStatus = MapStatus$.MODULE$.apply(

    blockManager.shuffleServerId(), partitionLengths, mapId);

}

mergeSpills()方法：

private long[] mergeSpills(SpillInfo[] spills) throws IOException {

  long[] partitionLengths;

  // 如果没有溢写文件，创建空的

  if (spills.length == 0) {

    final ShuffleMapOutputWriter mapWriter = shuffleExecutorComponents

        .createMapOutputWriter(shuffleId, mapId, partitioner.numPartitions());

    return mapWriter.commitAllPartitions();

    // 如果只有一个溢写文件，将它合并输出

  } else if (spills.length == 1) {

    Optional<SingleSpillShuffleMapOutputWriter> maybeSingleFileWriter =

        shuffleExecutorComponents.createSingleFileMapOutputWriter(shuffleId, mapId);

    if (maybeSingleFileWriter.isPresent()) {

      // Here, we don't need to perform any metrics updates because the bytes written to this

      // output file would have already been counted as shuffle bytes written.

      partitionLengths = spills[0].partitionLengths;

      maybeSingleFileWriter.get().transferMapSpillFile(spills[0].file, partitionLengths);

    } else {

      partitionLengths = mergeSpillsUsingStandardWriter(spills);

    }

    // 如果有多个，合并输出，合并的时候有NIO和BIO两种方式

  } else {

    partitionLengths = mergeSpillsUsingStandardWriter(spills);

  }

  return partitionLengths;

}

3.3 SortShuffleWriter

SortShuffleWriter会使用PartitionedAppendOnlyMap或PartitionedPariBuffer在内存中进行排序，如果超过内存限制，会溢写到文件中，在全局输出有序文件的时候，对之前的所有输出文件和当前内存中的数据进行全局归并排序，对key相同的元素会使用定义的function进行聚合，入口为write()方法：

override def write(records: Iterator[Product2[K, V]]): Unit = {

  // 创建一个外部排序器，如果map端有预聚合，就传入aggregator和keyOrdering，否则不需要传入

  sorter = if (dep.mapSideCombine) {

    new ExternalSorter[K, V, C](

      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)

  } else {

    // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't

    // care whether the keys get sorted in each partition; that will be done on the reduce side

    // if the operation being run is sortByKey.

    new ExternalSorter[K, V, V](

      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)

  }

  // 将数据放入ExternalSorter进行排序

  sorter.insertAll(records)

  // Don't bother including the time to open the merged output file in the shuffle write time,

  // because it just opens a single file, so is typically too fast to measure accurately

  // (see SPARK-3570).

  // 创建一个输出Wrtier

  val mapOutputWriter = shuffleExecutorComponents.createMapOutputWriter(

    dep.shuffleId, mapId, dep.partitioner.numPartitions)

  // 将外部排序的数据写入Writer

  sorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)

  val partitionLengths = mapOutputWriter.commitAllPartitions()

  // 更新mapstatus

  mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths, mapId)

}

insertAll()方法：

def insertAll(records: Iterator[Product2[K, V]]): Unit = {

  // TODO: stop combining if we find that the reduction factor isn't high

  val shouldCombine = aggregator.isDefined

  // 是否需要map端聚合

  if (shouldCombine) {

    // Combine values in-memory first using our AppendOnlyMap

    // 使用AppendOnlyMap在内存中聚合values

    // 获取mergeValue()函数，将新值合并到当前聚合结果中

    val mergeValue = aggregator.get.mergeValue

    // 获取createCombiner()函数，创建聚合初始值

    val createCombiner = aggregator.get.createCombiner

    var kv: Product2[K, V] = null

    // 如果一个key当前有聚合值，则合并，如果没有创建初始值

    val update = (hadValue: Boolean, oldValue: C) => {

      if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)

    }

    // 遍历

    while (records.hasNext) {

      // 增加读取记录数

      addElementsRead()

      kv = records.next()

      // map为PartitionedAppendOnlyMap，将分区和key作为key，聚合值作为value

      map.changeValue((getPartition(kv._1), kv._1), update)

      // 是否需要溢写到磁盘

      maybeSpillCollection(usingMap = true)

    }

    // 如果不需要map端聚合

  } else {

    // Stick values into our buffer

    while (records.hasNext) {

      addElementsRead()

      val kv = records.next()

      // buffer为PartitionedPairBuffer，将分区和key加进去

      buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])

      // 是否需要溢写到磁盘

      maybeSpillCollection(usingMap = false)

    }

  }

}

该方法主要是判断在插入数据时，是否需要在map端进行预聚合，分别采用两种数据结构来保存

maybeSpillCollection()方法里面会调用maybeSpill()方法检查是否需要溢写，如果发生溢写，重新构造一个map或者buffer结构从头开始缓存，如下：

private def maybeSpillCollection(usingMap: Boolean): Unit = {

  var estimatedSize = 0L

  if (usingMap) {

    estimatedSize = map.estimateSize()

    // 判断是否需要溢写

    if (maybeSpill(map, estimatedSize)) {

      map = new PartitionedAppendOnlyMap[K, C]

    }

  } else {

    estimatedSize = buffer.estimateSize()

    // 判断是否需要溢写

    if (maybeSpill(buffer, estimatedSize)) {

      buffer = new PartitionedPairBuffer[K, C]

    }

  }

  if (estimatedSize > _peakMemoryUsedBytes) {

    _peakMemoryUsedBytes = estimatedSize

  }

}

  protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {

    var shouldSpill = false

    // 如果读取的记录数是32的倍数，并且预估map或者buffer内存占用大于默认的5m阈值

    if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {

      // Claim up to double our current memory from the shuffle memory pool

      // 尝试申请2*currentMemory-5m的内存

      val amountToRequest = 2 * currentMemory - myMemoryThreshold

      val granted = acquireMemory(amountToRequest)

      // 更新阈值

      myMemoryThreshold += granted

      // If we were granted too little memory to grow further (either tryToAcquire returned 0,

      // or we already had more memory than myMemoryThreshold), spill the current collection

      // 判断，如果还是不够，确定溢写

      shouldSpill = currentMemory >= myMemoryThreshold

    }

    // 如果shouldSpill为false，但是读取的记录数大于Integer.MAX_VALUE，也是需要溢写

    shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold

    // Actually spill

    if (shouldSpill) {

      // 溢写次数+1

      _spillCount += 1

      logSpillage(currentMemory)

      // 溢写缓存的集合

      spill(collection)

      _elementsRead = 0

      _memoryBytesSpilled += currentMemory

      // 释放内存

      releaseMemory()

    }

    shouldSpill

  }

maybeSpill()方法里面会调用spill()进行溢写，如下：

  override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {

    // 根据给定的比较器进行排序，返回排序结果的迭代器

    val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)

    // 将迭代器中的数据溢写到磁盘文件中

    val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)

    // ArrayBuffer记录所有溢写的文件

    spills += spillFile

  }

spillMemoryIteratorToDisk()方法如下：

private[this] def spillMemoryIteratorToDisk(inMemoryIterator: WritablePartitionedIterator)

    : SpilledFile = {

  // Because these files may be read during shuffle, their compression must be controlled by

  // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use

  // createTempShuffleBlock here; see SPARK-3426 for more context.

  // 创建一个临时块

  val (blockId, file) = diskBlockManager.createTempShuffleBlock()

  // These variables are reset after each flush

  var objectsWritten: Long = 0

  val spillMetrics: ShuffleWriteMetrics = new ShuffleWriteMetrics

  // 创建溢写文件的DiskBlockObjectWriter

  val writer: DiskBlockObjectWriter =

    blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)

  // List of batch sizes (bytes) in the order they are written to disk

  // 记录写入批次大小

  val batchSizes = new ArrayBuffer[Long]

  // How many elements we have in each partition

  // 记录每个分区条数

  val elementsPerPartition = new Array[Long](numPartitions)

  // Flush the disk writer's contents to disk, and update relevant variables.

  // The writer is committed at the end of this process.

  // 将内存中的数据按批次刷写到磁盘中

  def flush(): Unit = {

    val segment = writer.commitAndGet()

    batchSizes += segment.length

    _diskBytesSpilled += segment.length

    objectsWritten = 0

  }

  var success = false

  try {

    // 遍历map或者buffer中的记录

    while (inMemoryIterator.hasNext) {

      val partitionId = inMemoryIterator.nextPartition()

      require(partitionId >= 0 && partitionId < numPartitions,

        s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})")

      // 写入并更新计数值

      inMemoryIterator.writeNext(writer)

      elementsPerPartition(partitionId) += 1

      objectsWritten += 1

      // 写入条数达到10000条时，将这批刷写到磁盘

      if (objectsWritten == serializerBatchSize) {

        flush()

      }

    }

    // 遍历完以后，将剩余的刷写到磁盘

    if (objectsWritten > 0) {

      flush()

    } else {

      writer.revertPartialWritesAndClose()

    }

    success = true

  } finally {

    if (success) {

      writer.close()

    } else {

      // This code path only happens if an exception was thrown above before we set success;

      // close our stuff and let the exception be thrown further

      writer.revertPartialWritesAndClose()

      if (file.exists()) {

        if (!file.delete()) {

          logWarning(s"Error deleting ${file}")

        }

      }

    }

  }

  // 返回溢写文件

  SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)

}

接下来就是排序合并操作，调用ExternalSorter.writePartitionedMapOutput()方法：

def writePartitionedMapOutput(

    shuffleId: Int,

    mapId: Long,

    mapOutputWriter: ShuffleMapOutputWriter): Unit = {

  var nextPartitionId = 0

  // 如果没有发生溢写

  if (spills.isEmpty) {

    // Case where we only have in-memory data

    val collection = if (aggregator.isDefined) map else buffer

    // 根据指定的比较器进行排序

    val it = collection.destructiveSortedWritablePartitionedIterator(comparator)

    while (it.hasNext()) {

      val partitionId = it.nextPartition()

      var partitionWriter: ShufflePartitionWriter = null

      var partitionPairsWriter: ShufflePartitionPairsWriter = null

      TryUtils.tryWithSafeFinally {

        partitionWriter = mapOutputWriter.getPartitionWriter(partitionId)

        val blockId = ShuffleBlockId(shuffleId, mapId, partitionId)

        partitionPairsWriter = new ShufflePartitionPairsWriter(

          partitionWriter,

          serializerManager,

          serInstance,

          blockId,

          context.taskMetrics().shuffleWriteMetrics)

        // 将分区内的数据依次取出

        while (it.hasNext && it.nextPartition() == partitionId) {

          it.writeNext(partitionPairsWriter)

        }

      } {

        if (partitionPairsWriter != null) {

          partitionPairsWriter.close()

        }

      }

      nextPartitionId = partitionId + 1

    }

    // 如果发生溢写，将溢写文件和缓存数据进行归并排序，排序完成后按照分区依次写入ShufflePartitionPairsWriter

  } else {

    // We must perform merge-sort; get an iterator by partition and write everything directly.

    // 这里会进行归并排序

    for ((id, elements) <- this.partitionedIterator) {

      val blockId = ShuffleBlockId(shuffleId, mapId, id)

      var partitionWriter: ShufflePartitionWriter = null

      var partitionPairsWriter: ShufflePartitionPairsWriter = null

      TryUtils.tryWithSafeFinally {

        partitionWriter = mapOutputWriter.getPartitionWriter(id)

        partitionPairsWriter = new ShufflePartitionPairsWriter(

          partitionWriter,

          serializerManager,

          serInstance,

          blockId,

          context.taskMetrics().shuffleWriteMetrics)

        if (elements.hasNext) {

          for (elem <- elements) {

            partitionPairsWriter.write(elem._1, elem._2)

          }

        }

      } {

        if (partitionPairsWriter != null) {

          partitionPairsWriter.close()

        }

      }

      nextPartitionId = id + 1

    }

  }

  context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)

  context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)

  context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)

}

partitionedIterator()方法：

def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {

  val usingMap = aggregator.isDefined

  val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer

  if (spills.isEmpty) {

    // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps

    // we don't even need to sort by anything other than partition ID

    // 如果没有溢写，并且没有排序，只按照分区id排序

    if (ordering.isEmpty) {

      // The user hasn't requested sorted keys, so only sort by partition ID, not key

      groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))

      // 如果没有溢写但是排序，先按照分区id排序，再按key排序

    } else {

      // We do need to sort by both partition ID and key

      groupByPartition(destructiveIterator(

        collection.partitionedDestructiveSortedIterator(Some(keyComparator))))

    }

  } else {

    // Merge spilled and in-memory data

    // 如果有溢写，就将溢写文件和内存中的数据归并排序

    merge(spills, destructiveIterator(

      collection.partitionedDestructiveSortedIterator(comparator)))

  }

}

归并方法如下：

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])

    : Iterator[(Int, Iterator[Product2[K, C]])] = {

  // 读取溢写文件

  val readers = spills.map(new SpillReader(_))

  val inMemBuffered = inMemory.buffered

  // 遍历分区

  (0 until numPartitions).iterator.map { p =>

    val inMemIterator = new IteratorForPartition(p, inMemBuffered)

    // 合并溢写文件和内存中的数据

    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)

    // 如果有聚合逻辑，按分区聚合，对key按照keyComparator排序

    if (aggregator.isDefined) {

      // Perform partial aggregation across partitions

      (p, mergeWithAggregation(

        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))

      // 如果没有聚合，但是有排序逻辑，按照ordering做归并

    } else if (ordering.isDefined) {

      // No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);

      // sort the elements without trying to merge them

      (p, mergeSort(iterators, ordering.get))

      // 什么都没有直接归并

    } else {

      (p, iterators.iterator.flatten)

    }

  }

}

在write()方法中调用commitAllPartitions()方法输出数据，其中调用writeIndexFileAndCommit()方法写出数据和索引文件，如下：

def writeIndexFileAndCommit(

    shuffleId: Int,

    mapId: Long,

    lengths: Array[Long],

    dataTmp: File): Unit = {

  // 创建索引文件和临时索引文件

  val indexFile = getIndexFile(shuffleId, mapId)

  val indexTmp = Utils.tempFileWith(indexFile)

  try {

    // 获取shuffle data file

    val dataFile = getDataFile(shuffleId, mapId)

    // There is only one IndexShuffleBlockResolver per executor, this synchronization make sure

    // the following check and rename are atomic.

    // 对于每个executor只有一个IndexShuffleBlockResolver，确保原子性

    synchronized {

      // 检查索引是否和数据文件已经有了对应关系

      val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)

      if (existingLengths != null) {

        // Another attempt for the same task has already written our map outputs successfully,

        // so just use the existing partition lengths and delete our temporary map outputs.

        // 如果存在对应关系，说明shuffle write已经完成，删除临时索引文件

        System.arraycopy(existingLengths, 0, lengths, 0, lengths.length)

        if (dataTmp != null && dataTmp.exists()) {

          dataTmp.delete()

        }

      } else {

        // 如果不存在，创建一个BufferedOutputStream

        // This is the first successful attempt in writing the map outputs for this task,

        // so override any existing index and data files with the ones we wrote.

        val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp)))

        Utils.tryWithSafeFinally {

          // We take in lengths of each block, need to convert it to offsets.

          // 获取每个分区的大小，累加偏移量，写入临时索引文件

          var offset = 0L

          out.writeLong(offset)

          for (length <- lengths) {

            offset += length

            out.writeLong(offset)

          }

        } {

          out.close()

        }

        // 删除可能存在的其他索引文件

        if (indexFile.exists()) {

          indexFile.delete()

        }

        // 删除可能存在的其他数据文件

        if (dataFile.exists()) {

          dataFile.delete()

        }

        // 将临时文件重命名成正式文件

        if (!indexTmp.renameTo(indexFile)) {

          throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)

        }

        if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {

          throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)

        }

      }

    }

  } finally {

    if (indexTmp.exists() && !indexTmp.delete()) {

      logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")

    }

  }

}

4.小结

Spark在初始化SparkEnv的时候，会在create()方法里面初始化ShuffleManager，包含sort和tungsten-sort两种shuffle
ShuffleManager是一个特质，核心方法有registerShuffle()、getReader()、getWriter()，
SortShuffleManager是ShuffleManager的唯一实现类，在registerShuffle()方法里面选择采用哪种shuffle机制，getReader()方法只会返回一种BlockStoreShuffleReader，getWriter()方法根据不同的handle选择不同的Writer，共有三种
BypassMergeSortShuffleWriter：如果当前shuffle依赖中没有map端的聚合操作，并且分区数小于spark.shuffle.sort.bypassMergeThreshold的值，默认为200，启用bypass机制，核心方法有：write()、writePartitionedData()（合并所有分区文件，默认采用零拷贝方式）
UnsafeShuffleWriter：如果serializer支持relocation并且map端没有聚合同时分区数目不大于16777215+1三个条件都满足，采用该Writer，核心方法有：write()、insertRecordIntoSorter()（将数据插入外部选择器排序）、closeAndWriteOutput()（合并并输出文件），前一个方法里核心方法有：insertRecord()（将序列化数据插入外部排序器）、growPointerArrayIfNecessary()（如果需要额外空间需要对数组扩容或溢写到磁盘）、spill()（溢写到磁盘）、writeSortedFile()（将内存中的数据进行排序并写出到磁盘文件中）、encodePageNumberAndOffset()（对当前数据的逻辑地址进行编码，转成long型），后面的方法里核心方法有：mergeSpills()（合并溢写文件），合并文件的时候有BIO和NIO两种方式
SortShuffleWriter：如果上面两者都不满足，采用该Writer，该Writer会使用PartitionedAppendOnlyMap或PartitionedPariBuffer在内存中进行排序，如果超过内存限制，会溢写到文件中，在全局输出有序文件的时候，对之前的所有输出文件和当前内存中的数据进行全局归并排序，对key相同的元素会使用定义的function进行聚合核心方法有：write()、insertAll()（将数据放入ExternalSorter进行排序）、maybeSpillCollection()（是否需要溢写到磁盘）、maybeSpill()、spill()、spillMemoryIteratorToDisk()（将内存中数据溢写到磁盘）、writePartitionedMapOutput()、commitAllPartitions()里面调用writeIndexFileAndCommit()方法写出数据和索引文件

秒客网

Spark Shuffle机制详细源码解析