在完成聚类分析后,如何知道新数据属于哪个簇

时间:2022-04-15 07:25:13

After finishing cluster analysis,when I input some new data,how Do I know which cluster do the data belongs to?

在完成聚类分析之后,当我输入一些新的数据时,如何知道数据属于哪个簇?

data(freeny)
library(RSNNS)
options(digits=2)
year<-as.integer(rownames(freeny))
freeny<-cbind(freeny,year)
freeny = freeny[sample(1:nrow(freeny),length(1:nrow(freeny))),1:ncol(freeny)]
freenyValues= freeny[,1:5]
freenyTargets=decodeClassLabels(freeny[,6])
freeny = splitForTrainingAndTest(freenyValues,freenyTargets,ratio=0.15)
km<-kmeans(freeny$inputsTrain,10,iter.max = 100)
kclust=km$cluster

1 个解决方案

#1


3  

kmeans returns an object containing the coordinates of the cluster centers in $centers. You want to find the cluster to which the new object is closest (in terms of the sum of squares of distances):

kmeans返回一个对象,该对象包含$centers中集群中心的坐标。您希望找到新对象最接近的簇(以距离平方和计算):

v <- freeny$inputsTrain[1,] # just an example
which.min( sapply( 1:10, function( x ) sum( ( v - km$centers[x,])^2 ) ) )

The above returns 8 - same as the cluster to which the first row of freeny$inputsTrain was assigned.

上面的返回8——与分配freeny$inputsTrain的第一行相同。

In an alternative approach, you can first create a clustering, and then use a supervised machine learning to train a model which you will then use as a prediction. However, the quality of the model will depend on how good the clustering really represents the data structure and how much data you have. I have inspected your data with PCA (my favorite tool):

在另一种方法中,您可以首先创建集群,然后使用监控机器学习来训练模型,然后将其用作预测。但是,模型的质量取决于集群是否真正代表数据结构以及您拥有多少数据。我用PCA(我最喜欢的工具)检查了你的数据:

pca <- prcomp( freeny$inputsTrain, scale.= TRUE )
library( pca3d )
pca3d( pca )

My impression is that you have at most 6-7 clear classes to work with:

我的印象是,你最多有6-7个清晰的课程可以使用:

在完成聚类分析后,如何知道新数据属于哪个簇

However, one should run more kmeans diagnostic (elbow plots etc) to determine the optimal number of clusters:

但是,应该运行更多的kmeans诊断(弯头图等)来确定最优的集群数量:

wss <- sapply( 1:10, function( x ) { km <- kmeans(freeny$inputsTrain,x,iter.max = 100 ) ; km$tot.withinss } )
plot( 1:10, wss )

在完成聚类分析后,如何知道新数据属于哪个簇

This plot suggests 3-4 classes as the optimum. For a more complex and informative approach, consult the clusterograms: http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

这个图建议3-4个类是最优的。要获得更复杂和信息丰富的方法,请参阅clustergram: http://www.r- statistics.com/2010/06/clustergram-visualizationand diagnosis - For clusterster -analysis-r-code/

#1


3  

kmeans returns an object containing the coordinates of the cluster centers in $centers. You want to find the cluster to which the new object is closest (in terms of the sum of squares of distances):

kmeans返回一个对象,该对象包含$centers中集群中心的坐标。您希望找到新对象最接近的簇(以距离平方和计算):

v <- freeny$inputsTrain[1,] # just an example
which.min( sapply( 1:10, function( x ) sum( ( v - km$centers[x,])^2 ) ) )

The above returns 8 - same as the cluster to which the first row of freeny$inputsTrain was assigned.

上面的返回8——与分配freeny$inputsTrain的第一行相同。

In an alternative approach, you can first create a clustering, and then use a supervised machine learning to train a model which you will then use as a prediction. However, the quality of the model will depend on how good the clustering really represents the data structure and how much data you have. I have inspected your data with PCA (my favorite tool):

在另一种方法中,您可以首先创建集群,然后使用监控机器学习来训练模型,然后将其用作预测。但是,模型的质量取决于集群是否真正代表数据结构以及您拥有多少数据。我用PCA(我最喜欢的工具)检查了你的数据:

pca <- prcomp( freeny$inputsTrain, scale.= TRUE )
library( pca3d )
pca3d( pca )

My impression is that you have at most 6-7 clear classes to work with:

我的印象是,你最多有6-7个清晰的课程可以使用:

在完成聚类分析后,如何知道新数据属于哪个簇

However, one should run more kmeans diagnostic (elbow plots etc) to determine the optimal number of clusters:

但是,应该运行更多的kmeans诊断(弯头图等)来确定最优的集群数量:

wss <- sapply( 1:10, function( x ) { km <- kmeans(freeny$inputsTrain,x,iter.max = 100 ) ; km$tot.withinss } )
plot( 1:10, wss )

在完成聚类分析后,如何知道新数据属于哪个簇

This plot suggests 3-4 classes as the optimum. For a more complex and informative approach, consult the clusterograms: http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

这个图建议3-4个类是最优的。要获得更复杂和信息丰富的方法,请参阅clustergram: http://www.r- statistics.com/2010/06/clustergram-visualizationand diagnosis - For clusterster -analysis-r-code/