Journal of Frontiers of Computer Science and Technology ›› 2015, Vol. 9 ›› Issue (8): 973-984.DOI: 10.3778/j.issn.1673-9418.1409062

Previous Articles     Next Articles

K-medoids Clustering Algorithms with Optimized Initial Seeds by Variance

XIE Juanying+, GAO Rui   

  1. School of Computer Science, Shaanxi Normal University, Xi’an 710062, China
  • Online:2015-08-01 Published:2015-08-06

方差优化初始中心的K-medoids聚类算法

谢娟英+,高  瑞   

  1. 陕西师范大学 计算机科学学院,西安 710062

Abstract: To overcome the deficiencies of fast K-medoids clustering algorithm of its computational load in computing the density of points and its initial seeds may locating in a same cluster, and to overcome the disadvantages of neighborhood-based K-medoids algorithm of its arbitrary in selecting a coefficient to adjust the radius of its neighborhood, this paper proposes two new variance based K-medoids clustering algorithms. These new algorithms respectively choose the mean distance between instances and the standard deviation of a specific instance as the radius of a neighborhood, and select the instances with minimum variance as initial seeds one by one where the distance between initial seeds is at least the radius of the neighborhood, so that the expected number of initial seeds have been got. This paper tests the proposed algorithms on the real datasets from UCI machine learning repository and on the synthetically generated datasets, and compares their performance in terms of many popular criteria for clustering. The experimental results demonstrate that the proposed new K-medoids clustering algorithms can obtain better clustering in short time, and they are scalable to cluster a comparable large scale dataset.

Key words: variance, standard deviation, neighborhood, initial seeds; K-medoids clustering

摘要: 针对快速K-medoids聚类算法存在密度计算复杂耗时和初始聚类中心可能位于同一类簇的缺陷,以及基于邻域的K-medoids算法的邻域半径需要人为给定一个调节系数的主观性缺陷,分别以样本间距离均值和相应样本的标准差为邻域半径,以方差作为样本分布密集程度的度量,选取方差值最小且其间距离不低于邻域半径的样本为K-medoids的初始聚类中心,提出了两种方差优化初始中心的K-medoids算法。在UCI数据集和人工模拟数据集上进行了实验测试,并对各种聚类指标进行了比较,结果表明该算法需要的聚类时间短,得到的聚类结果优,适用于较大规模数据集的聚类。

关键词: 方差, 标准差, 邻域, 初始聚类中心, K-medoids聚类