Spark2 Dataset统计指标:mean均值,variance方差,stddev标准差,corr(Pearson相关系数),skewness偏度,kurtosis峰度
val df4=spark.sql("SELECT mean(age),variance(age),stddev(age),corr(age,yearsmarried),skewness(age),kurtosis(age) FROM Affairs")df4.show+--------+---...
CentOS7安装CDH 第十章:CDH中安装Spark2
相关文章链接CentOS7安装CDH 第一章:CentOS7系统安装CentOS7安装CDH 第二章:CentOS7各个软件安装和启动CentOS7安装CDH 第三章:CDH中的问题和解决方法CentOS7安装CDH 第四章:CDH的版本选择和安装方式CentOS7安装CDH 第五章:CDH的安装和...
Spark2 Dataset分析函数--排名函数row_number,rank,dense_rank,percent_rank
select gender, age, row_number() over(partition by gender order by age) as rowNumber, rank() over(partition by gender order by age) ...
Spark2 Random Forests 随机森林
随机森林是决策树的集合。 随机森林结合许多决策树,以减少过度拟合的风险。 spark.ml实现支持随机森林,使用连续和分类特征,做二分类和多分类以及回归。导入包import org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.Da...
Spark2 文件处理和jar包执行
上传数据文件mkdir -p data/ml/hadoop fs -mkdir -p /datafile/wangxiao/hadoop fs -ls /hadoop fs -put /home/wangxiao/data/ml/Affairs.txt /datafile/wangxiao/hado...
Spark2 DataSet 创建新行之flatMap
val dfList = List(("Hadoop", "Java,SQL,Hive,HBase,MySQL"), ("Spark", "Scala,SQL,DataSet,MLlib,GraphX"))dfList: List[(String, String)] = List((Hadoop,J...
Spark2 Dataset DataFrame空值null,NaN判断和处理
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.Dataset import org.apache.spark.sql.Row import org.apache.spark.sql.DataFrame i...