计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (2): 230-247.DOI: 10.3778/j.issn.1673-9418.1506072

• 人工智能与模式识别 • 上一篇    下一篇

密度峰值优化初始中心的K-medoids聚类算法

谢娟英+,屈亚楠   

  1. 陕西师范大学 计算机科学学院,西安 710062
  • 出版日期:2016-02-01 发布日期:2016-02-03

K-medoids Clustering Algorithms with Optimized Initial Seeds by Density Peaks

XIE Juanying+, QU Yanan   

  1. School of Computer Science, Shaanxi Normal University, Xi’an 710062, China
  • Online:2016-02-01 Published:2016-02-03

摘要: 针对快速K-medoids聚类算法和方差优化初始中心的K-medoids聚类算法存在需要人为给定类簇数,初始聚类中心可能位于同一类簇,或无法完全确定数据集初始类簇中心等缺陷,受密度峰值聚类算法启发,提出了两种自适应确定类簇数的K-medoids算法。算法采用样本xit最近邻距离之和倒数度量其局部密度ρi,并定义样本xi的新距离δi,构造样本距离相对于样本密度的决策图。局部密度较高且相距较远的样本位于决策图的右上角区域,且远离数据集的大部分样本。选择这些样本作为初始聚类中心,使得初始聚类中心位于不同类簇,并自动得到数据集类簇数。为进一步优化聚类结果,提出采用类内距离与类间距离之比作为聚类准则函数。在UCI数据集和人工模拟数据集上进行了实验测试,并对初始聚类中心、迭代次数、聚类时间、Rand指数、Jaccard系数、Adjusted Rand index和聚类准确率等经典聚类有效性评价指标进行了比较,结果表明提出的K-medoids算法能有效识别数据集的真实类簇数和合理初始类簇中心,减少聚类迭代次数,缩短聚类时间,提高聚类准确率,并对噪音数据具有很好的鲁棒性。

关键词: 聚类, K-medoids算法, 初始聚类中心, 密度峰值, 准则函数

Abstract: To overcome the deficiencies of the fast K-medoids and the variance based K-medoids clustering algorithms whose number of clusters of a dataset must be provided manually and their initial seeds may locate in a same cluster or cannot be totally found etc. Stimulated by the density peak clustering algorithm, this paper proposes two new K-medoids clustering algorithms. The new algorithms define the local density ρi of point xi as the reciprocal of the sum of the distance betweenxiand its t nearest neighbors, and new distance δi of point xi is defined as well, then the decision graph of a point distance relative to its local density is plotted. The points with higher local density and apart from each other located at the upper right corner of the decision graph, which are far away from the rest points in the same dataset, are chosen as the initial seeds for K-medoids, so that the seeds will be in different clusters and the number of clusters of the dataset is automatically determined as the number of initial seeds. In order to get a better clustering, a new measure function is proposed as the ratio of the intra-distance of clusters to the inter-distance between clusters. The proposed two new K-medoids algorithms are tested on the real datasets from UCI machine learning repository and on the synthetic datasets. The clustering results of the proposed algorithms are evaluated in terms of the initial seeds selected, iterations, clustering time, Rand index, Jaccard coefficient, Adjusted Rand index and the clustering accuracy. The experimental results demonstrate that the proposed new K-medoids clustering algorithms can recognize the number of clusters of a dataset, find its proper initial seeds, reduce the clustering iterations and the clustering time, improve the clustering accuracy, and are robust to noises as well.

Key words: clustering, K-medoids algorithm, initial seeds, density peak, measure function