二分k均值 Python实现

时间:2022-06-02 00:16:03

二分k-均值算法:

算法思想:

首先将所有点作为一个簇,然后将该簇一分为二。之后选择能最大程度降低聚类代价函数(也就是误差平方和)的簇划分为两个簇。以此进行下去,直到簇的数目等于用户给定的数目k为止

算法伪代码:

*************************************************************
将所有数据点看成一个簇
当簇数目小于k时
对每一个簇

在给定的簇上面进行k-均值聚类(k=2)

计算总误差

选择使得误差最大的那个簇进行划分操作

*************************************************************

Python代码实现:

from numpy import *
import pdb
import matplotlib.pyplot as plt

def createCenter(dataSet,k):
n = shape(dataSet)[0]
d = shape(dataSet)[1]
centroids = zeros((k,d))
for i in range(k):
c = int(random.uniform(0,n-1)) #float
centroids[i,:] = dataSet[c,:]
return centroids

def getDist(vec1,vec2):
return sqrt(sum(power(vec1 - vec2,2)))

def kmeans(dataSet,k):
n = shape(dataSet)[0]
clusterAssment = mat(zeros((n,2)))
centroids = createCenter(dataSet,k)

clusterChnaged = True
while clusterChnaged:
clusterChnaged = False

for i in range(n):
minDist = inf
minIndex = -1
for j in range(k):
distJI = getDist(dataSet[i,:],centroids[j,:])
if distJI < minDist:
minDist = distJI
minIndex = j
if clusterAssment[i,0] != minIndex: #Convergence condition: distributions no longer change
clusterChnaged = True
clusterAssment[i,:] = minIndex,minDist**2

#update centroids
for i in range(k):
ptsdataSet = dataSet[nonzero(clusterAssment[:,0].A == i)[0]]
centroids[i,:] = mean(ptsdataSet,axis = 0)
return centroids,clusterAssment

def print_result(dataSet,k,centroids,clusterAssment):
n,d = dataSet.shape
if d !=2:
print "Cannot draw!"
return 1
mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
if k > len(mark):
print "Sorry your k is too large"
return 1

for i in range(n):
markIndex = int(clusterAssment[i,0])
plt.plot(dataSet[i, 0],dataSet[i, 1],mark[markIndex])
mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
# draw the centroids
for i in range(k):
plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)
plt.show()

def biKmeans(dataSet, k):
numSamples = dataSet.shape[0]
# first column stores which cluster this sample belongs to,
# second column stores the error between this sample and its centroid
clusterAssment = mat(zeros((numSamples, 2)))

# step 1: the init cluster is the whole data set
centroid = mean(dataSet, axis = 0).tolist()[0]
centList = [centroid]
for i in xrange(numSamples):
clusterAssment[i, 1] = getDist(mat(centroid), dataSet[i, :])**2

while (len(centList) < k):
# min sum of square error
minSSE = inf
numCurrCluster = len(centList)
# for each cluster
for i in range(numCurrCluster):
# step 2: get samples in cluster i
pointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :]

# step 3: cluster it to 2 sub-clusters using k-means
centroids, splitClusterAssment = kmeans(pointsInCurrCluster, 2)

# step 4: calculate the sum of square error after split this cluster
splitSSE = sum(splitClusterAssment[:, 1])
notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])
currSplitSSE = splitSSE + notSplitSSE

# step 5: find the best split cluster which has the min sum of square error
if currSplitSSE < minSSE:
minSSE = currSplitSSE
bestCentroidToSplit = i
bestNewCentroids = centroids.copy()
bestClusterAssment = splitClusterAssment.copy()

# step 6: modify the cluster index for adding new cluster
bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = numCurrCluster
bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit

# step 7: update and append the centroids of the new 2 sub-cluster
centList[bestCentroidToSplit] = bestNewCentroids[0, :]
centList.append(bestNewCentroids[1, :])

# step 8: update the index and error of the samples whose cluster have been changed
clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit), :] = bestClusterAssment
plt.figure()
print_result(dataSet,len(centList),mat(centList),clusterAssment)

print 'Congratulations, cluster using bi-kmeans complete!'
return mat(centList), clusterAssment

其中,biKmeans(dataSet,k)为二分算法的主体,过程大体如下:

1.初始化质心,并建立所需要的数据存储结构

2.对每一个簇进行二分,选出最好的

3.更新各个簇的元素个数

划分结果:

二分k均值 Python实现二分k均值 Python实现二分k均值 Python实现

二分的优点:

  • 二分K均值算法可以加速K-means算法的执行速度,因为相似度计算少了
  • 不受初始化问题的影响,因为随机点选取少了,且每一步保证误差最小

k均值的结果:

二分k均值 Python实现