006 Spark中的wordcount以及TopK的程序编写

1.启动

　　启动HDFS

　　启动spark的local模式./spark-shell

2.知识点

　textFile:

  def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String]

　Filter:　

　　Return a new RDD containing only the elements that satisfy a predicate.

　　def filter(f: T => Boolean): RDD[T],返回里面判断是true的RDD。

　map:

  Return a new RDD by applying a function to all elements of this RDD.

　def map[U: ClassTag](f: T => U): RDD[U],从T到U类型的一个数据转换函数，最终返回的RDD中的数据类型是f函数返回的数据类型

　flatMap：

    Return a new RDD by first applying a function to all elements of this
RDD, and then flattening the results.

    def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
　　从T到集合类型的数据类型转换，集合中的数据类型是U，最终返回的RDD数据类型是f函数返回的集合中的具体的类型数据。

3.编写基础的wordcount程序

 //读取文件

 val rdd=sc.textFile("wc/input/wc.input")

 //过滤数据

 val filterRdd=rdd.filter(len=>len.length>0)

 //数据转换

 val flatMapRdd=filterRdd.flatMap(line=>line.split(" ")

     .map(word=>(word,1)))

 //分组

 val groupByRdd=flatMapRdd.groupBy(tuple=>tuple._1)

 //聚合

 val wordCount=groupByRdd.map(tuple=>{

     val word=tuple._1

     val sum=tuple._2.toList.foldLeft(0)((a,b)=>a+b._2)

     (word,sum)

 })

 //输出

 wordCount.foreach(println)             //控制台上的输出

 wordCount.saveAsTextFile("wc/output6") //HDFS上的输出

4.简化代码（链式编程）

 sc.textFile("wc/input/wc.input").

 //数据过滤

 filter(_.length>0).

 //数据转换

 flatMap(_.split(" ").map((_,1))).

 //分组

 groupByKey().

 //统计

 map(tuple=>(tuple._1,tuple._2.toList.sum)).

 //输出

 saveAsTextFile("wc/output7")

5.最优化程序

　　reduceByKey存在combiner。

　　groupBy在大数据量的情况下，会出现OOM

 sc.textFile("wc/input/wc.input").

 //数据过滤

 filter(_.length>0).

 //数据转换

 flatMap(_.split(" ").map((_,1))).

 //统计

 reduceByKey(_+_).

 //输出

 saveAsTextFile("wc/output8")

6.显示结果

 sc.textFile("wc/input/wc.input").

 //数据过滤

 filter(_.length>).

 //数据转换

 flatMap(_.split(" ").map((_,))).

 //统计

 reduceByKey(_+_).

 collect()

7.排序（第二个数，从大到小）

 sc.textFile("wc/input/wc.input").

 //数据过滤

 filter(_.length>).

 //数据转换

 flatMap(_.split(" ").map((_,))).

 //统计

 reduceByKey(_+_).

 //排序

 sortBy(tuple=>tuple._2,ascending=false).

 collect()

8.TopK(方式一)

 sc.textFile("wc/input/wc.input").

 //数据过滤

 filter(_.length>).

 //数据转换

 flatMap(_.split(" ").map((_,))).

 //统计

 reduceByKey(_+_).

 //排序

 sortBy(tuple=>tuple._2,ascending=false).

 take()

9.TopK（方式二，自定义）

 sc.textFile("wc/input/wc.input").

 //数据过滤

 filter(_.length>).

 //数据转换

 flatMap(_.split(" ").map((_,))).

 //统计

 reduceByKey(_+_).

 //排序

 sortBy(tuple=>tuple._2,ascending=false).

 top()(new scala.math.Ordering[(String,Int)](){

     override def compare(x:(String,Int),y:(String,Int))={

         val tmp=x._2.compare(y._2)

         if(tmp!=) tmp

         else x._1.compare(x._1)

     }

     })

秒客网

006 Spark中的wordcount以及TopK的程序编写

相关文章