【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理

spark 2.1.1

spark中可以通过RDD.sortBy来对分布式数据进行排序，具体是如何实现的？来看代码：

org.apache.spark.rdd.RDD

  /**

   * Return this RDD sorted by the given key function.

   */

  def sortBy[K](

      f: (T) => K,

      ascending: Boolean = true,

      numPartitions: Int = this.partitions.length)

      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {

    this.keyBy[K](f)

        .sortByKey(ascending, numPartitions)

        .values

  }

  /**

   * Creates tuples of the elements in this RDD by applying `f`.

   */

  def keyBy[K](f: T => K): RDD[(K, T)] = withScope {

    val cleanedF = sc.clean(f)

    map(x => (cleanedF(x), x))

  }

  /**

   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling

   * `collect` or `save` on the resulting RDD will return or output an ordered list of records

   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in

   * order of the keys).

   */

  // TODO: this currently doesn't work on P other than Tuple2!

  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)

      : RDD[(K, V)] = self.withScope

  {

    val part = new RangePartitioner(numPartitions, self, ascending)

    new ShuffledRDD[K, V, V](self, part)

      .setKeyOrdering(if (ascending) ordering else ordering.reverse)

  }

代码比较简单：sort是一个transformation操作，需要定义一个keyBy，即根据什么排序，然后会做一步map，即 item -> (keyBy(item), item)，然后定义一个Partitioner，即分区策略（多少个分区，升序降序等），最后返回一个ShuffledRDD；

ShuffledRDD原理详见 https://www.cnblogs.com/barneywill/p/10158457.html

这里重点说下RangePartitioner：

org.apache.spark.RangePartitioner

/**

 * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly

 * equal ranges. The ranges are determined by sampling the content of the RDD passed in.

 *

 * @note The actual number of partitions created by the RangePartitioner might not be the same

 * as the `partitions` parameter, in the case where the number of sampled records is less than

 * the value of `partitions`.

 */

class RangePartitioner[K : Ordering : ClassTag, V](

    partitions: Int,

    rdd: RDD[_ <: Product2[K, V]],

    private var ascending: Boolean = true)

  extends Partitioner {

  // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.

  require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")

  private var ordering = implicitly[Ordering[K]]

  // An array of upper bounds for the first (partitions - 1) partitions

  private var rangeBounds: Array[K] = {

    if (partitions <= 1) {

      Array.empty

    } else {

      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.

      val sampleSize = math.min(20.0 * partitions, 1e6)

      // Assume the input partitions are roughly balanced and over-sample a little bit.

      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt

      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)

      if (numItems == 0L) {

        Array.empty

      } else {

        // If a partition contains much more than the average number of items, we re-sample from it

        // to ensure that enough items are collected from that partition.

        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)

        val candidates = ArrayBuffer.empty[(K, Float)]

        val imbalancedPartitions = mutable.Set.empty[Int]

        sketched.foreach { case (idx, n, sample) =>

          if (fraction * n > sampleSizePerPartition) {

            imbalancedPartitions += idx

          } else {

            // The weight is 1 over the sampling probability.

            val weight = (n.toDouble / sample.length).toFloat

            for (key <- sample) {

              candidates += ((key, weight))

            }

          }

        }

        if (imbalancedPartitions.nonEmpty) {

          // Re-sample imbalanced partitions with the desired sampling probability.

          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)

          val seed = byteswap32(-rdd.id - 1)

          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()

          val weight = (1.0 / fraction).toFloat

          candidates ++= reSampled.map(x => (x, weight))

        }

        RangePartitioner.determineBounds(candidates, partitions)

      }

    }

  }

  def numPartitions: Int = rangeBounds.length + 1

  private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]

  def getPartition(key: Any): Int = {

    val k = key.asInstanceOf[K]

    var partition = 0

    if (rangeBounds.length <= 128) {

      // If we have less than 128 partitions naive search

      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {

        partition += 1

      }

    } else {

      // Determine which binary search method to use only once.

      partition = binarySearch(rangeBounds, k)

      // binarySearch either returns the match location or -[insertion point]-1

      if (partition < 0) {

        partition = -partition-1

      }

      if (partition > rangeBounds.length) {

        partition = rangeBounds.length

      }

    }

    if (ascending) {

      partition

    } else {

      rangeBounds.length - partition

    }

  }

这里会根据partition的数量确定rangeBounds，rangeBounds很像QuickSort中的pivot，

举例来说：集群现在有10个节点，对1亿数据做排序，partition数量是100，最理想的情况是1亿数据平均分成100份，然后每个节点存放10份，然后各自排序就好，没有数据倾斜；
但是这个很难实现，要注意的是这里平分的过程实际上也是划分边界的过程，即确定每份的最小值和最大值边界，需要对全部数据遍历统计之后才能精确实现；

spark中采用的是一种通过对数据采样了解数据分布并最终达到近似精确的方式，具体实现为在从全部数据中采样sampleSize个数据，每个分区采样sampleSizePerPartition个，如果某些分区很大，会追加采样个数，这样保证采样过程尽可能的平均，然后针对采样数据进行探测划分边界，得到rangeBounds，有了rangeBounds之后就可以知道1亿数据中的每一条具体在哪个新的分区；

还有一个问题：在sort之后如果collect到driver，array数据还会保持排序状态吗？

org.apache.spark.rdd.RDD

  /**

   * Return an array that contains all of the elements in this RDD.

   *

   * @note This method should only be used if the resulting array is expected to be small, as

   * all the data is loaded into the driver's memory.

   */

  def collect(): Array[T] = withScope {

    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

    Array.concat(results: _*)

  }

答案是肯定的；

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理的更多相关文章

【原创】大数据基础之Hadoop（1）HA实现原理
有些工作只能在一台server上进行,比如master,这时HA(High Availability)首先要求部署多个server,其次要求多个server自动选举出一个active状态server, ...
大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（4）RDD原理及代码解析
一简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

C++ 基础知识复习（一）
数据类型,常量与变量部分:(发现有些点竟然这么多年第一次发现) C++基本数据类型有哪些: 答:整型,浮点型,void型. 注:其他各种数据类型均是这三种类型的扩充,另外void类型在实际程序中经常用 ...
nginx（4、缓存）
nginx提供内置的缓存功能,对静态文件,如html\css\js等能够缓存在本地,即nginx服务器的某个目录下. 其配置主要是两部分: 1.在http下配置一个缓存路径: proxy_cache_ ...
css小技巧，会不断更新的
1.去除input记住密码后自动填充表单的黄色背景 input:-webkit-autofill { -webkit-box-shadow: 0 0 0px 1000px #FFF inset; }/ ...
UIPickerView简单应用
下面是一些效果图下面是代码.有些枯燥 , 其实并不难 . #import <UIKit/UIKit.h> @interface ViewController : UIViewContro ...
android&colon; 多线程编程基础
9.1 服务是什么服务(Service)是 Android 中实现程序后台运行的解决方案,它非常适合用于去执行那些不需要和用户交互而且还要求长期运行的任务.服务的运行不依赖于任何用户界面,即使 ...
FZU 2212 Super Mobile Charger（超级充电宝）
[Description] [题目描述] While HIT ACM Group finished their contest in Shanghai and is heading back Harb ...
HTTPS-透彻学习汇总
SSL和SSH和OpenSSH,OpenSSL有什么区别一.SSL的作用不使用SSL/TLS的HTTP通信,就是不加密的通信.所有信息明文传播,带来了三大风险. 窃听风险(eavesdroppin ...
小米2S twrp 中文，支持双系统
更新日志: 更新日志: 汉化了要使用的功能修改语言选择方式,修改为下拉方式 TDB(TrueDualBoot) 功能完美实现 **adb**功能,完美实现,无需特别操作(比CWM强大) 修改双系统切 ...
java安全管理器SecurityManager介绍
java安全管理器类SecurityManager简单剖析: javadoc介绍: SecurityManager是一个允许应用实现一种安全策略的类.它允许一个应用去明确,在执行一个可能安全或者敏感的 ...
python3+selenium入门01-环境搭建
作为一个测试,在最近两年应该有明显的感觉.那就是工作变的难找,要求变的高了,自动化测试,性能测试等.没有自动化测试能力,只会点点点工作难找不说,工资也不高.所以还是要学习一些技术.首先要学习一门编程语 ...