DStream-05 updateStateByKey函数的原理和源码

Demo

updateState 可以到达将每次 word count 计算的结果进行累加。

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))

    ssc.sparkContext.setLogLevel("WARN")

    val lines = ssc.socketTextStream("localhost", 9999)

    ssc.checkpoint("/Users/chouc/Work/IdeaProjects/learning/learning/spark/src/main/resources/checkpoint/SocketDstream")

    val wordCounts = lines.flatMap(_.split(" ")).map((_,1)).updateStateByKey[Int]((seq:Seq[Int],total:Option[Int])=>{

      total match {

        case Some(value) => Option(seq.sum + value)

        case None => Option(seq.sum)

      }

    })

    wordCounts.print()

    ssc.start()

    ssc.awaitTermination()

源码

其实想要达到累加还是比较简单。

只要将本次计算的结果 + 上一次计算结果就可以了。

入口就是 updateStateByKey

PairDStreamFunctions

def updateStateByKey[S: ClassTag](

      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

      partitioner: Partitioner,

      rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {

    val cleanedFunc = ssc.sc.clean(updateFunc)

    val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {

      cleanedFunc(it)

    }

    new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)

  }

文章 DStream-04 window 函数时候，提到了。每次计算后，每个DStream 都会将上一次的RDD 放入内存中，以供下一次使用，这样一来也就更简单。如果获取上一次的RDD呢，也就是当前batch time 减去 slideDuration 就等于上一个批次的时间戳，可以通过getOrCompute 得到。

slideDuration 默认情况就是 batchInterval 批次间隔时间。在window 中也是批次时间。

StateDStream

class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](

    parent: DStream[(K, V)],

    updateFunc: (Time, Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

    partitioner: Partitioner,

    preservePartitioning: Boolean,

    initialRDD: Option[RDD[(K, S)]]

  ) extends DStream[(K, S)](parent.ssc) {

  // 这边注意，这个StateDStream 需要设置checkpoint 地址 来保存数据。

  super.persist(StorageLevel.MEMORY_ONLY_SER)

  override val mustCheckpoint = true

// 这个方法就是将 前一个batch RDD 的结果和当前计算的结果合并

  private [this] def computeUsingPreviousRDD(

      batchTime: Time,

      parentRDD: RDD[(K, V)],

      prevStateRDD: RDD[(K, S)]) = {

    // Define the function for the mapPartition operation on cogrouped RDD;

    // first map the cogrouped tuple to tuples of required type,

    // and then apply the update function

    val updateFuncLocal = updateFunc

    val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {

      val i = iterator.map { t =>

        val itr = t._2._2.iterator

        val headOption = if (itr.hasNext) Some(itr.next()) else None

        (t._1, t._2._1.toSeq, headOption)

      }

      updateFuncLocal(batchTime, i)

    }

	// cogroup 合并

    val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)

	// 然后将合并后的结果计算

    val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)

    Some(stateRDD)

  }

  override def compute(validTime: Time): Option[RDD[(K, S)]] = {

    // Try to get the previous state RDD

	// 算出上一个batch time 来获取上一个batch的RDD。

    getOrCompute(validTime - slideDuration) match {

      //如果有就说明之前有RDD，如果没有则当前是第一个batch

      case Some(prevStateRDD) =>    // If previous state RDD exists

        // Try to get the parent RDD

		// 获取当前这个批次来的数据 。这边理解有点绕，parent.getOrCompute(validTime) 就是前一个DStream 计算的结果，可以看下MappedDStream 的 方法就比较清楚了。

        parent.getOrCompute(validTime) match {

          case Some(parentRDD) =>    // If parent RDD exists, then compute as usual

		    // 见两个RDD 的数据。

            computeUsingPreviousRDD (validTime, parentRDD, prevStateRDD)

          case None =>     // If parent RDD does not exist

            // Re-apply the update function to the old state RDD

            val updateFuncLocal = updateFunc

            val finalFunc = (iterator: Iterator[(K, S)]) => {

              val i = iterator.map(t => (t._1, Seq.empty[V], Option(t._2)))

              updateFuncLocal(validTime, i)

            }

            val stateRDD = prevStateRDD.mapPartitions(finalFunc, preservePartitioning)

            Some(stateRDD)

        }

      case None =>    // If previous session RDD does not exist (first input data)

        // Try to get the parent RDD

        parent.getOrCompute(validTime) match {

          case Some(parentRDD) =>   // If parent RDD exists, then compute as usual

            initialRDD match {

              case None =>

                // Define the function for the mapPartition operation on grouped RDD;

                // first map the grouped tuple to tuples of required type,

                // and then apply the update function

                val updateFuncLocal = updateFunc

                val finalFunc = (iterator: Iterator[(K, Iterable[V])]) => {

                  updateFuncLocal (validTime,

                    iterator.map (tuple => (tuple._1, tuple._2.toSeq, None)))

                }

                val groupedRDD = parentRDD.groupByKey(partitioner)

                val sessionRDD = groupedRDD.mapPartitions(finalFunc, preservePartitioning)

                // logDebug("Generating state RDD for time " + validTime + " (first)")

                Some (sessionRDD)

              case Some (initialStateRDD) =>

                computeUsingPreviousRDD(validTime, parentRDD, initialStateRDD)

            }

          case None => // If parent RDD does not exist, then nothing to do!

            // logDebug("Not generating state RDD (no previous state, no parent)")

            None

        }

    }

  }

}

秒客网

DStream-05 updateStateByKey函数的原理和源码

Demo

源码

PairDStreamFunctions

StateDStream

相关文章