为非常大的数据集处理和机器学习在R中推荐的包

It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory?

看起来，R实际上是用来处理数据集的，它可以把数据集完全导入到内存中。对于不能被拉入内存的非常大的数据集，建议用什么R包进行信号处理和机器学习?

If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets)

如果R仅仅是一种错误的方法，我愿意接受其他健壮的免费建议(例如，如果有一些很好的方法来处理非常大的数据集，那么可以使用scipy)

5 个解决方案

#1

Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.

请查看CRAN上高性能计算任务视图的“大内存和内存外数据”小节。bigmemory和ff是两个流行的包。对于bigmemory(以及相关的biganalytics和bigtabulate)来说，bigmemory网站有一些非常好的介绍、小说集和Jay Emerson的概述。对于ff，我推荐您阅读Adler Oehlschlagel和同事在ff网站上出色的幻灯片演示。

Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglm package, as well as this presentation from Thomas Lumley.

此外，考虑将数据存储在数据库中，并以较小的批次读取数据以便进行分析。可能有很多方法需要考虑。首先，consdier浏览了biglm包中的一些例子，以及Thomas Lumley的这次演讲。

And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.

在高性能计算任务视图中对其他包进行调查，并在其他答案中提到。我上面提到的软件包只是我碰巧有更多的经验。

#2

I think the amount of data you can process is more limited by ones programming skills than anything else. Although a lot of standard functionality is focused on in memory analysis, cutting your data into chunks already helps a lot. Ofcourse, this takes more time to program than picking up standard R code, but often times it is quite possible.

我认为你能处理的数据量更多的是受到编程技能的限制。虽然很多标准功能都集中在内存分析上，但是将数据分割成块已经非常有用了。

Cutting up data can for exale be done using read.table or readBin which support only reading a subset of the data. Alternatively, you can take a look at the high performance computing task view for packages which deliver out of the box out of memory functionality. You could also put your data in a database. For spatial raster data, the excellent raster package provides out of memory analysis.

可以使用read对exale进行数据切割。只支持读取数据子集的表或读库。或者，您也可以查看高性能计算任务视图，查看包是否提供了开箱即用的内存功能。您还可以将数据放入数据库中。对于空间光栅数据，优秀的光栅包提供内存分析。

#3

For machine learning tasks I can recommend using biglm package, used to do "Regression for data too large to fit in memory". For using R with really big data, one can use Hadoop as a backend and then use package rmr to perform statistical (or other) analysis via MapReduce on a Hadoop cluster.

对于机器学习任务，我推荐使用biglm包，用于“对大到无法装入内存的数据进行回归”。对于真正使用大数据的R，可以使用Hadoop作为后端，然后使用包rmr在Hadoop集群上通过MapReduce执行统计(或其他)分析。

#4

It all depends on algorithms you need. If they may be translated into incremental form (when only small part of data is needed at any given moment, e.g. for Naive Bayes you can hold in memory only the model itself and current observation being processed), then the best suggestion is to perform machine learning incrementally, reading new batches of data from disk.

这一切都取决于你需要的算法。如果他们可能翻译成增量形式(只有一小部分的数据时需要在任何给定的时刻,例如朴素贝叶斯只能保存在内存中正在处理的模型本身和当前观测),最好的建议是逐步进行机器学习,阅读新批次的数据从磁盘。

However, many algorithms and especially their implementations really require the whole dataset. If size of the dataset fits you disk (and file system limitations), you can use mmap package that allows to map file on disk to memory and use it in the program. Note however, that read-writes to disk are expensive, and R sometimes likes to move data back and forth frequently. So be careful.

然而，许多算法，特别是它们的实现确实需要整个数据集。如果数据集的大小适合您的磁盘(以及文件系统限制)，您可以使用mmap包，它允许将磁盘上的文件映射到内存，并在程序中使用它。但是请注意，对磁盘的读-写开销很大，而且R有时喜欢频繁地来回移动数据。所以要小心。

If your data can't be stored even on you hard drive, you will need to use distributed machine learning systems. One such R-based system is Revolution R which is designed to handle really large datasets. Unfortunately, it is not open source and costs quite a lot of money, but you may try to get free academic license. As alternative, you may be interested in Java-based Apache Mahout - not so elegant, but very efficient solution, based on Hadoop and including many important algorithms.

如果您的数据甚至不能存储在硬盘上，您将需要使用分布式机器学习系统。其中一个基于R的系统是Revolution R，它设计用来处理非常大的数据集。不幸的是，它不是开源的，而且花费很多钱，但是你可以尝试获得免费的学术许可。作为替代方案，您可能对基于java的Apache Mahout感兴趣——它不是那么优雅，但是基于Hadoop并包含许多重要算法的非常高效的解决方案。

#5

If the memory is not sufficient enough, one solution is push data to disk and using distributed computing. I think RHadoop(R+Hadoop) may be one of the solution to tackle with large amount dataset.

如果内存不足，一种解决方案是将数据推送到磁盘并使用分布式计算。我认为RHadoop(R+Hadoop)可能是处理大量数据集的解决方案之一。

#1