关于python:pyclusteringclusterkmeans使用讲解

45次阅读

共计 2383 个字符,预计需要花费 6 分钟才能阅读完成。

pyclustering 是一个聚类分析的 python 库。本文将对其中的 kmeans 库解说。

最近自己在用 kmeans 算法做一些钻研,有个想法是把 kmeans 的间隔函数更换,但 sklearn 并没有提供接口,本人造的轮子成果也并不好。最初找到 pyclustering 库,因而在这记录一下应用心得。

kmeans 训练过程如博客所示。

用到的包
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.kmeans import kmeans
import numpy as np
1. 初始化形心:
initial_centers = kmeans_plusplus_initializer(x, cluster_num).initialize()

其中 x 是数据,cluster\_num 是簇数目。

2. 实例化 kmeans 类:
kmeans_instance = kmeans(x, initial_centers, metric=metric)

metric 是度量间隔,默认是欧式间隔,上面细讲。

3. 训练:
kmeans_instance.process()
3. 归类:
clusters = kmeans_instance.get_clusters()

把下面的训练的数据 x 以列表模式归类。比方数据 a,b,c 类别别离是 1,1,0 则返回 index 列表 [[0,0],[1]]

4. 返回形心:
cs = kmeans_instance.get_centers()
5. 预测:

对于预测,这里给出了几种办法以适应不同场景。

  • 首先是间接用实例类预测
label = kmeans_instance.predict(x)
  • 依据之前失去的 clusters
label = np.array([0]*len(x))
for i,sub in enumerate(clusters):
    label[sub] = i
  • 依据失去的形心,这里间接封装成函数,metric 是度量函数
def Clu_predict(x,cs,class_num,metric = distance_metric(type_metric.EUCLIDEAN)):
    differences = np.zeros((len(x), class_num))
    for index_point in range(len(x)):
        differences[index_point] = [metric(x[index_point], c) for c in cs]
    label = np.argmin(differences, axis=1)
    return label

留神这里效率很满,举荐本人定义矩阵运算。

6. 度量:
  • 应用库的度量,以曼哈顿间隔为例:
manhattan_metric = distance_metric(type_metric.MANHATTAN)
kmeans_instance = kmeans(x, initial_centers, metric=manhattan_metric)

把 type_metric. 前面的换掉就行,库提供的间隔有

class type_metric(IntEnum):
    """!
 @brief Enumeration of supported metrics in the module for distance calculation between two points.
 """## Euclidean distance, for more information see function'euclidean_distance'. EUCLIDEAN = 0
 ## Square Euclidean distance, for more information see function 'euclidean_distance_square'.
 EUCLIDEAN_SQUARE = 1
 ## Manhattan distance, for more information see function 'manhattan_distance'.
 MANHATTAN = 2
 ## Chebyshev distance, for more information see function 'chebyshev_distance'.
 CHEBYSHEV = 3
 ## Minkowski distance, for more information see function 'minkowski_distance'.
 MINKOWSKI = 4
 ## Canberra distance, for more information see function 'canberra_distance'.
 CANBERRA = 5
 ## Chi square distance, for more information see function 'chi_square_distance'.
 CHI_SQUARE = 6
 ## Gower distance, for more information see function 'gower_distance'.
 GOWER = 7
 ## User defined function for distance calculation between two points.
 USER_DEFINED = 1000
  • 应用自定义间隔,以余弦间隔为例:
def cosine_distance(a, b):
     a_norm = np.linalg.norm(a)
     b_norm = np.linalg.norm(b)
     similiarity = np.dot(a, b.T)/(a_norm * b_norm)
     dist = 1. - similiarity
     return dist
metric = distance_metric(type_metric.USER_DEFINED, func=cosine_distance)

间隔只需实现计算两个点的间隔即可。

正文完
 0