随机森林算法demo python spark

时间:2022-02-06 02:29:25

关键参数

最重要的,常常需要调试以提高算法效果的有两个参数:numTrees,maxDepth。

  • numTrees(决策树的个数):增加决策树的个数会降低预测结果的方差,这样在测试时会有更高的accuracy。训练时间大致与numTrees呈线性增长关系。
  • maxDepth:是指森林中每一棵决策树最大可能depth,在决策树中提到了这个参数。更深的一棵树意味模型预测更有力,但同时训练时间更长,也更倾向于过拟合。但是值得注意的是,随机森林算法和单一决策树算法对这个参数的要求是不一样的。随机森林由于是多个的决策树预测结果的投票或平均而降低而预测结果的方差,因此相对于单一决策树而言,不容易出现过拟合的情况。所以随机森林可以选择比决策树模型中更大的maxDepth。 
    甚至有的文献说,随机森林的每棵决策树都最大可能地进行生长而不进行剪枝。但是不管怎样,还是建议对maxDepth参数进行一定的实验,看看是否可以提高预测的效果。 
    另外还有两个参数,subsamplingRate,featureSubsetStrategy一般不需要调试,但是这两个参数也可以重新设置以加快训练,但是值得注意的是可能会影响模型的预测效果(如果需要调试的仔细读下面英文吧)。

We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide. 
The first two parameters we mention are the most important, and tuning them can often improve performance: 
(1)numTrees: Number of trees in the forest. 
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. 
Training time increases roughly linearly in the number of trees. 
(2)maxDepth: Maximum depth of each tree in the forest. 
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting. 
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest). 
The next two parameters generally do not require tuning. However, they can be tuned to speed up training. 
(3)subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training. 
(4)featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low. 
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.

"""
Random Forest Classification Example.
"""
from __future__ import print_function from pyspark import SparkContext
# $example on$
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
# $example off$ if __name__ == "__main__":
sc = SparkContext(appName="PythonRandomForestClassificationExample")
# $example on$
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a RandomForest model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Note: Use larger numTrees in practice.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32) # Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification forest model:')
print(model.toDebugString()) # Save and load model
model.save(sc, "target/tmp/myRandomForestClassificationModel")
sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")
# $example off$

模型样子:

TreeEnsembleModel classifier with 3 trees

  Tree 0:
If (feature 511 <= 0.0)
If (feature 434 <= 0.0)
Predict: 0.0
Else (feature 434 > 0.0)
Predict: 1.0
Else (feature 511 > 0.0)
Predict: 0.0
Tree 1:
If (feature 490 <= 31.0)
Predict: 0.0
Else (feature 490 > 31.0)
Predict: 1.0
Tree 2:
If (feature 302 <= 0.0)
If (feature 461 <= 0.0)
If (feature 208 <= 107.0)
Predict: 1.0
Else (feature 208 > 107.0)
Predict: 0.0
Else (feature 461 > 0.0)
Predict: 1.0
Else (feature 302 > 0.0)
Predict: 0.0

随机森林算法demo python spark的更多相关文章

  1. Spark mllib 随机森林算法的简单应用(附代码)

    此前用自己实现的随机森林算法,应用在titanic生还者预测的数据集上.事实上,有很多开源的算法包供我们使用.无论是本地的机器学习算法包sklearn 还是分布式的spark mllib,都是非常不错 ...

  2. H2O中的随机森林算法介绍及其项目实战(python实现)

    H2O中的随机森林算法介绍及其项目实战(python实现) 包的引入:from h2o.estimators.random_forest import H2ORandomForestEstimator ...

  3. 用Python实现随机森林算法,深度学习

    用Python实现随机森林算法,深度学习 拥有高方差使得决策树(secision tress)在处理特定训练数据集时其结果显得相对脆弱.bagging(bootstrap aggregating 的缩 ...

  4. spark 随机森林算法案例实战

    随机森林算法 由多个决策树构成的森林,算法分类结果由这些决策树投票得到,决策树在生成的过程当中分别在行方向和列方向上添加随机过程,行方向上构建决策树时采用放回抽样(bootstraping)得到训练数 ...

  5. Python机器学习笔记——随机森林算法

    随机森林算法的理论知识 随机森林是一种有监督学习算法,是以决策树为基学习器的集成学习算法.随机森林非常简单,易于实现,计算开销也很小,但是它在分类和回归上表现出非常惊人的性能,因此,随机森林被誉为“代 ...

  6. 随机森林算法OOB&lowbar;SCORE最佳特征选择

    RandomForest算法(有监督学习),可以根据输入数据,选择最佳特征组合,减少特征冗余:原理:由于随机决策树生成过程采用的Boostrap,所以在一棵树的生成过程并不会使用所有的样本,未使用的样 ...

  7. Bagging与随机森林算法原理小结

    在集成学习原理小结中,我们讲到了集成学习有两个流派,一个是boosting派系,它的特点是各个弱学习器之间有依赖关系.另一种是bagging流派,它的特点是各个弱学习器之间没有依赖关系,可以并行拟合. ...

  8. R语言︱决策树族——随机森林算法

    每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 笔者寄语:有一篇<有监督学习选择深度学习 ...

  9. R语言︱机器学习模型评估方案(以随机森林算法为例)

    笔者寄语:本文中大多内容来自<数据挖掘之道>,本文为读书笔记.在刚刚接触机器学习的时候,觉得在监督学习之后,做一个混淆矩阵就已经足够,但是完整的机器学习解决方案并不会如此草率.需要完整的评 ...

随机推荐

  1. python使用pdkdf2加盐密码

    from werkzeug.security import generate_password_hash, check_password_hash pw = generate_password_has ...

  2. RegQueryValueEx正确使用方法

    项目中需要读取注册表中的HKEY_CLASSES_ROOT主键下一个子键的值,看了看MSDN的说明,有RegOpenKeyEx和RegQueryValueEx两个函数可以用.也没仔细阅读函数说明,就写 ...

  3. hive创建索引

    索引是hive0.7之后才有的功能,创建索引需要评估其合理性,因为创建索引也是要磁盘空间,维护起来也是需要代价的 创建索引 hive> create index [index_studentid ...

  4. linux调整分区大小

    查看一下当前分区情况 1 2 3 4 5 6 7 8 [root@localhost ~]# df -h Filesystem            Size  Used Avail Use% Mou ...

  5. android操作文件

    Android中读取/写入文件的方法,与Java中的I/O是一样的,提供了openFileInput()和openFileOutput()方法来读取设备上的文件.但是在默认状态下,文件是不能在不同的程 ...

  6. 使用「max-height」实现自适应高度

    .tab-content{ max-height: 0; overflow: hidden; -webkit-transition: max-height .8s; -moz-transition: ...

  7. RZ10

    设定一些系统参数     例如在生成table maintenance的时候 由于表格结构复杂 导致生成维护程序时 超出了默认的内存限制 这时候可以通过RZ10 修改 zzta/dynpro_area ...

  8. Stream流的读取使用

    这个是在现在的项目里,用到的知识点,一般用不到这些..所以想着还是记下来以后用. 针对文本流 //StreamReader sr = new StreamReader(mystream,Encodin ...

  9. commands - &grave;tar&grave;

    remove files after pack: tar --remove-files -cf all.tar * compression: -j: bzip2 -z: gzip add file t ...

  10. CentOS 编译安装 Redis &lpar;实测 笔记 Centos 7&period;3 &plus; redis 3&period;2&period;8&rpar;

    环境: 系统硬件:vmware vsphere (CPU:2*4核,内存2G,双网卡) 系统版本:CentOS-7.0-1406-x86_64-DVD.iso 安装步骤: 1.准备 1.1 显示系统版 ...