在R中聚集非常大的数据集。

I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory

我有一个数据集，包含70000个数字值，表示从0到50的距离，我想把这些数字聚在一起;然而,如果我在经典的聚类方法,然后我将不得不建立一个70000 x70,000距离矩阵代表每两个数字之间的距离我的数据集,而不适合在内存中,所以我想知道如果有任何聪明的方式来解决这个问题,而不需要分层抽样吗?我还在R中尝试了bigmemory和big analytics库，但仍然无法将数据放入内存中。

2 个解决方案

#1

You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

您可以使用kmeans(通常适用于此数量的数据)来计算一个重要的中心数量(1000,2000，…)，并在这些中心的坐标上执行层次聚类方法。像这样，距离矩阵会变小。

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

#2

70000 is not large. It's not small, but it's also not particularly large... The problem is the limited scalability of matrix-oriented approaches.

70000并不大。它并不小，但也不是特别大……问题是面向矩阵的方法的可伸缩性有限。

But there are plenty of clustering algorithms which do not use matrixes and do no need O(n^2) (or even worse, O(n^3)) runtime.

但是有大量的聚类算法，它们不使用矩阵，也不需要O(n 2)(或者更糟，O(n 3))运行时。

You may want to try ELKI, which has great index support (try the R*-tree with SortTimeRecursive bulk loading). The index support makes it a lot lot lot faster.

您可能想尝试一下ELKI，它有很好的索引支持(尝试使用SortTimeRecursive批量加载的R*树)。索引支持使它更快。

If you insist on using R, give at least kmeans a try and the fastcluster package. K-means has runtime complexity O(n*k*i) (where k is the parameter k, and i is the number of iterations); fastcluster has an O(n) memory and O(n^2) runtime implementation of single-linkage clustering comparable to the SLINK algorithm in ELKI. (The R "agnes" hierarchical clustering will use O(n^3) runtime and O(n^2) memory).

如果您坚持使用R，那么至少要尝试一下kmeans和fastcluster包。k -means有运行时复杂度O(n*k*i) (k为参数k, i为迭代次数);fastcluster有O(n)内存和O(n ^ 2)的运行时实现单键ELKI集群与早产的算法。(R“艾格尼丝”层次聚类将使用O(n ^ 3)运行时和O(n ^ 2)内存)。

Implementation matters. Often, implementations in R aren't the best IMHO, except for core R which usually at least has a competitive numerical precision. But R was built by statisticians, not by data miners. It's focus is on statistical expressiveness, not on scalability. So the authors aren't to blame. It's just the wrong tool for large data.

实现问题。通常，R中的实现并不是最好的IMHO，除了核心R，它通常至少具有竞争性的数值精度。但R是由统计学家建立的，而不是由数据采矿者。它关注的是统计表达性，而不是可伸缩性。因此，作者不应受到指责。对于大数据来说，这只是一个错误的工具。

Oh, and if your data is 1-dimensional, don't use clustering at all. Use kernel density estimation. 1 dimensional data is special: it's ordered. Any good algorithm for breaking 1-dimensional data into inverals should exploit that you can sort the data.

哦，如果你的数据是一维的，那就不要使用集群。用核密度估计。一维数据很特别:它是有序的。任何将一维数据分解成反演的好算法都应该利用这一点，你可以对数据进行排序。

#1

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

#2