融合最近邻矩阵与局部密度的自适应K-means聚类算法

doi:10.3778/j.issn.1673-9418.2110025

摘要/Abstract

摘要： 针对传统K-means聚类算法对初始聚类中心和离群孤立点敏感的缺陷，以及现有引入密度概念优化的K-means算法均需要设置密度参数或阈值的缺点，提出一种融合最近邻矩阵与局部密度的自适应K-means聚类算法。受最邻近吸收原则与密度峰值原则启发，通过引入数据对象间的距离差异值构造邻近矩阵，根据邻近矩阵计算局部密度,不需要任何参数设置，采取最近邻矩阵与局部密度融合策略，自适应确定初始聚类中心数目和位置，同时完成非中心点的初分配。人工数据集和UCI数据集的实验测试，以及与传统K-means算法、基于离群点改进的K-means算法、基于密度改进的K-means算法的实验比较表明，提出的自适应K-means算法对人工数据集的孤立点免疫度较高，对UCI数据集具有更准确的聚类结果。

关键词: 自适应K-means聚类算法, 密度峰值原则, 最邻近吸收原则, 局部密度

Abstract: To overcome the deficiencies of traditional K-means algorithms which are sensitive to the initial cluster centers and outliers, and their variants introducing densities, which need giving arbitrary parameters, this paper proposes an adaptive K-means clustering algorithm combining the nearest neighbor matrix and local density. Ins-pired by the nearest-neighbors and the density peaks, the adjacency matrix is constructed by introducing the distance difference between objects. Then the local density is calculated without any parameters except for the adjacency matrix. After that, the initial centers and the number of clusters of K-means are simultaneously determined by using the nearest-neighbor matrix and local density, and the rest objects are assigned as well. Experiments on synthetic datasets, and on real world datasets from UCI machine learning repository, and the comparisons with traditional K-means algorithm, and the improved K-means algorithms based on outliers or densities all demonstrate that the pro-posed adaptive K-means algorithm is robust to outliers on synthetic datasets, and obtains more accurate clustering results for real world datasets from UCI machine learning repository.

Key words: adaptive K-means clustering algorithm, density peak principle, nearest-neighbor principle, local density

艾力米努尔·库尔班, 谢娟英, 姚若侠. 融合最近邻矩阵与局部密度的自适应K-means聚类算法[J]. 计算机科学与探索, 2023, 17(2): 355-366.

Ailiminuer·Kuerban, XIE Juanying, YAO Ruoxia. Adaptive K-means Algorithm Combining Nearest-Neighbor Matrix and Local Density[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(2): 355-366.

参考文献

[1] SEMEH B S, SAMI N, ZIED C. A fast and effective parti-tional clustering algorithm for large categorical datasets using a K-means based approach[J]. Computers and Electri-cal Engineering, 2018, 68: 463-483.
[2] MACQUEEN J. Some methods for classification and analysis of multivariate observations[J]. Berkeley Symposium on Mathematical Statistics and Probability, 1967, 1(14): 281-297.
[3] SCITOVSKI R, SABO K, MARTíNEZ-áLVAREZ F, et al.Cluster analysis and applications[M]. Berlin, Heidelberg: Springer, 2021.
[4] POPOV A A, OVSYANKIN A K, EMOMALIVE M R, et al. Application of the clustering algorithm in an automated training system[J]. Journal of Physics: Conference Series, 2020, 1691(1): 012120-012130.
[5] PEI S F, NIE F P, WANG R, et al. An efficient density based clustering algorithm for face groping[J]. Neurocomputing, 2021, 462: 331-343.
[6] AHMED M, MAHMOOD A N. A novel approach for outlier detection and clustering improvement[C]//Proceedings of the 8th IEEE Conference on Industrial Electronics and Applica-tions, Melbourne, Jun 19-21, 2013. Piscataway: IEEE, 2013: 577-582.
[7] GAN G J, NG K P. K-means clustering with outlier removal[J]. Pattern Recognition Letters, 2017, 90: 8-14.
[8] 朱利, 邱媛媛, 于帅, 等. 一种基于快速K-近邻的最小生成树离群检测方法[J]. 计算机学报, 2017, 40(12): 2856-2870.
ZHU L, QIU Y Y, YU S, et al. A fast KNN-based MST outlier detection method[J]. Chinese Journal of Computers, 2017, 40(12): 2856-2870.
[9] 王彬宇, 刘文芬, 胡学先, 等. 基于余弦距离选取初始簇中心的文本聚类研究[J]. 计算机工程与应用, 2018, 54(10): 11-18.
WANG B Y, LIU W F, HU X X, et al. Research on text clus-tering for selecting initial cluster center based on cosine dis-tance[J]. Computer Engineering and Applications, 2018, 54(10): 11-18.
[10] 谢娟英, 王艳娥. 最小方差优化初始聚类中心的K-means算法[J]. 计算机工程, 2014, 40(8): 205-211.
XIE J Y, WANG Y E. K-means algorithm based on mini-mum deviation initialized clustering centers[J]. Computer En-gineering, 2014, 40(8): 205-211.
[11] 谢娟英, 高红超. 基于统计相关性与K-means的区分基因子集选择算法[J]. 软件学报, 2014, 25(9): 2050-2075.
XIE J Y, GAO H C. Statistical correlation and K-means based distinguishable gene subset selection algorithms[J]. Journal of Software, 2014, 25(9): 2050-2075.
[12] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.
[13] 谢娟英, 屈亚楠. 密度峰值优化初始中心的K-medoids聚类算法[J]. 计算机科学与探索, 2016, 10(2): 230-247.
XIE J Y, QU Y N. K-medoids clustering algorithms with optimized initial seeds by density peaks[J]. Journal of Fron-tiers of Computer Science and Technology, 2016, 10(2): 230-247.
[14] ZHANG G, ZHANG C, ZHANG H. Improved K-means al-gorithm based on density canopy[J]. Knowledge-Based Sys-tems, 2018, 145(29): 289-297.
[15] XIE J Y, GAO H C, XIE W X, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information Sciences, 2016, 354: 19-40.
[16] KATSAVOUNIDS I. A new initialization technique for ge-neralized Lloyd iteration[J]. IEEE Signal Processing Letters, 1994, 1(10): 144-146.
[17] SINGH D, SINGH B. Investigating the impact of data nor-malization on classification performance[J]. Applied Soft Computing, 2019, 97: 105524.
[18] ZHANG Y, CHUNG F L, WANG S. Fast exemplar-based clustering by gravity enrichment between data objects[J]. IEEE Transactions on Systems, Man, and Cybernetics: Sys-tems, 2018, 50(8): 2996-3009.
[19] 田诗宵, 丁立新, 郑金秋. 基于密度峰值优化的K-means文本聚类算法[J]. 计算机工程与设计, 2017, 38(4): 1019-1023.
TIAN S X, DING L X, ZHENG J Q. K-means text clus-tering algorithm based on density peaks[J]. Computer En-gineering and Design, 2017, 38(4): 1019-1023.
[20] BLAKE C L, MERZ C J. UCI repository of machine learning database[EB/OL]. [2021-07-28]. http://archive.ics.uci.edu/ml/index.html.
[21] TAO Q, GU C Q, WANG Z Y. An intelligent clustering al-gorithm for high-dimensional multiview data in big data app-lications[J]. Neurocomputing, 2020, 393: 234-244.
[22] HUBERT L, ARABIE P. Comparing partitions[J]. Journal of Classification, 1985, 2(1): 193-218.
[23] 杨燕, 靳蕃, KAMEL M. 聚类有效性评价综述[J].计算机应用研究, 2008, 25(6): 1630-1632.
YANG Y, JIN F, KAMEL M. Survey of clustering validity evaluation[J]. Application Research of Computers, 2008, 25(6): 1630-1632.