大数据入门到精通9-真正得wordcount

本章节实现一个真正得wordcount 得spark程序。

一、从本地获得一个数据集

val speechRdd= sc.parallelize(scala.io.Source.fromFile("/home/hdfs/Data/WordCount/speech").getLines.toList)

二、把每一样变成多行

val wordMap=speechRdd.flatMap(line=>{
line.split(" ")
})

三、去掉特殊字符，并取消大小写区别
val wordCount=wordMap.map(word=>{
val w=word.replaceAll("[,.?!:;]"," ").toLowerCase.trim
(w,1)
})

四、写reduce函数
val wordReduce=wordCount.reduceByKey((sum,current)=>{sum+current})

这里也可以不写reducebykey函数，而是直接对wordcount这个rdd做countByKey

scala> wordCount.countByKey
res2: scala.collection.Map[String,Long] = Map(krishna -> 1, beneath -> 1, opinions -> 1, beautiful -> 2, sunday -> 1, devastating -> 1, drown -> 1, cells -> 2, down -> 3, savings -> 1, heaviness -> 1, application -> 1, interesting -> 1, 7 30 -> 1, "" -> 27, desktop -> 1, read -> 1, papers -> 1, failure -> 2, mother -> 3, for -> 17, biopsy -> 1, find -> 4, school -> 1, directors -> 1, coke -> 1, people -> 1, begin -> 2, any -> 2, website -> 1, ?.tay -> 1, mac -> 3, decisions -> 1, across -> 1, gradually -> 1, years -> 9, i?. -> 3, young -> 2, talented -> 1, doctor?. -> 1, this -> 11, death -> 6, curable -> 1, in -> 34, subtle -> 1, remarkable -> 1, myself -> 2, have -> 17, learned -> 1, needed -> 1, your -> 16, ?.f -> 2, off -> 1, ?.f -> 1, fonts -> 1, offered -> 1, bottles -> 1, are -> ...
scala>

是一个map类型，不再是一个rdd类型。

wordReduce.take(20).foreach(println)