结合自然和共享最近邻的密度峰值聚类算法

doi:10.3778/j.issn.1673-9418.2006060

摘要/Abstract

摘要：

基于快速搜索和寻找密度峰值聚类算法（DPC）具有无需迭代且需要较少参数的优点，但其仍然存在一些缺点：需要人为选取截断距离参数；在流形数据集上的处理效果不佳。针对这些问题，提出一种密度峰值聚类改进算法。该算法结合了自然和共享最近邻算法，重新定义了截断距离和局部密度的计算方法，并且算法融合了候选聚类中心计算概念，通过算法选出不同的候选聚类中心，然后以这些候选中心为新的数据集，再次开始密度峰值聚类，最后将剩余的点分配到所对应的候选中心点所在类簇中。改进的算法在合成数据集和UCI数据集上进行验证，并与K-means、DBSCAN和DPC算法进行比较。实验结果表明，提出的算法在性能方面有明显提升。

关键词: 密度峰值聚类算法, 自然最近邻, 共享最近邻

Abstract:

The clustering by fast search and find of density peaks (DPC) has the advantages of no iteration and fewer parameters, but it still has some shortcomings: the need to manually select the cutoff distance parameter and the processing effect is not good on the manifold data set. In response to these problems, an improved density peak clustering algorithm is proposed. The algorithm combines the natural and shared nearest neighbor algorithm, redefines the calculation method of cut-off distance and local density. It integrates the concept of candidate cluster center calculation, selects different candidate cluster centers through the algorithm, uses these candidate centers as a new data set, and starts density peak clustering again. Finally, the remaining points are assigned to the clusters where the corresponding candidate center points are located. The improved algorithm is verified on the synthetic data set and UCI data set rows, and compared with the K-means, DBSCAN (density-based algorithm for discovering clusters in large spatial databases with noise) and DPC algorithm. Experimental results show that the algorithm proposed in this paper has significant improvement in performance.

Key words: density peak clustering algorithm, natural nearest neighbor, shared nearest neighbor

柏锷湘, 罗可, 罗潇. 结合自然和共享最近邻的密度峰值聚类算法[J]. 计算机科学与探索, 2021, 15(5): 931-940.

BAI Exiang, LUO Ke, LUO Xiao. Peak Density Clustering Algorithm Combining Natural and Shared Nearest Neighbor[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(5): 931-940.

参考文献

[1] CHEN M S, HAN J, YU P S. Data mining: an overview from a database perspective[J]. IEEE Transactions on Know-ledge & Data Engineering, 1996, 8(6): 866-883.
[2] HAN J W, KAMBER M. Data mining: concepts and tech-niques[M]. San Mateo: Morgan Kaufmann, 2006.
[3] RODRIGUEZ A, LAIO A. Clustering by fast search and ?nd of density peaks[J]. Science, 2014, 344(6191): 1492-1496.
[4] BAI L, CHENG X, LIANG J, et al. Fast density clustering strategies based on the k-means algorithm[J]. Pattern Recogni-tion, 2017, 71: 375-386.
[5] WANG Y, PENG T, HAN J Y, et al. Density-based distributed clustering method[J]. Journal of Software, 2017, 28(11): 2836-2850.
王岩, 彭涛, 韩佳育, 等. 一种基于密度的分布式聚类方法[J]. 软件学报, 2017, 28(11): 2836-2850.
[6] LIU R, WANG H, YU X M. Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J]. Infor-mation Sciences, 2018, 450: 200-226.
[7] DU M J, DING S F, JIA H J. Study on density peaks cluster-ing based on k-nearest neighbors and principal component analysis[J]. Knowledge Based Systems, 2016, 99: 135-145.
[8] XIE J Y, GAO H C, XIE W X, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information Sciences, 2016, 354: 19-40.
[9] XU X, DING S F, SUN T F. A fast density peaks clustering algorithm based on pre-screening[C]//Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing, Shanghai, Jan 15-17, 2018. Washington: IEEE Computer Society, 2018: 513-516.
[10] SEYEDI S A, LOTFI A, MORADI M, et al. Dynamic graph-based label propagation for density peaks clustering[J]. Expert Systems with Applications, 2019, 115: 314-328.
[11] HUANG J L, ZHU Q S, YANG L J, et al. QCC: a novel clustering algorithm based on quasi-cluster centers[J]. Machine Learning, 2017, 106(3): 337-357.
[12] WU C R, LEE J, ISOKAWA T, et al. Efficient clustering method based on density peaks with symmetric neighbor-hood relationship[J]. IEEE Access, 2019, 7: 60684-60696.
[13] DU P, CHENG X R. Comparative density peaks clustering based on K-nearest neighbors[J]. Computer Engineering and Applications, 2019, 55(10): 161-168.
杜沛, 程晓荣. 一种基于K近邻的比较密度峰值聚类算法[J]. 计算机工程与应用, 2019, 55(10): 161-168.
[14] QIAN X Z, JIN H. Optimized density peak clustering algo-rithm by adaptive aggregation strategy[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(4): 712-720.
钱雪忠, 金辉. 自适应聚合策略优化的密度峰值聚类算法[J]. 计算机科学与探索, 2020, 14(4): 712-720.
[15] WANG J H, LI J J, LI J S, et al. Adaptive fast search density peak clustering algorithm[J]. Computer Engineering and Applications, 2019, 55(24): 122-127.
王军华, 李建军, 李俊山, 等. 自适应快速搜索密度峰值聚类算法[J]. 计算机工程与应用, 2019, 55(24): 122-127.
[16] XU X, DING S F, SHI Z Z. An improved density peaks clustering algorithm with fast finding cluster centers[J]. Know-ledge Based Systems, 2018, 158: 65-74.
[17] ZHU Q S, FENG J, HUANG J L. Natural neighbor: a self-adaptive neighborhood method without parameter K[J]. Pattern Recognition Letters, 2016, 80: 30-36.
[18] CHENG D, ZHU Q, HUANG J, et al. Natural neighbor-based clustering algorithm with density peeks[C]//Proceed-ings of the 2016 International Joint Conference on Neural Networks, Jul 24-29, 2016. Piscataway: IEEE, 2016: 92-98.
[19] YANG P, ZHU Q S, HUANG B. Spectral clustering with density sensitive similarity function[J]. Knowledge Based Systems, 2011, 24(5): 621-628.
[20] FU L M, MEDICO E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data[J]. BMC Bioinformatics, 2007, 8(1): 3.
[21] CHANG H, YEUNG D Y. Robust path-based spectral cluster-ing[J]. Pattern Recognition, 2008, 41(1): 191-203.
[22] JAIN A K, LAW M H C. Data clustering: a user??s dilemma[C]//LNCS 3776: Proceedings of the 1st International Con-ference on Pattern Recognition and Machine Intelligence, Kolkata, Dec 20-22, 2005. Berlin, Heidelberg: Springer, 2005: 1-10.
[23] VEENMAN C J, REINDERS M J T, BACKER E. A maximum variance cluster algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(9): 1273-1280.
[24] GIONIS A, MANNILA H, TSAPARAS P. Clustering aggrega-tion[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 4.
[25] NGUYEN X V, EPPS J, BAILEY J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance[J]. Journal of Machine Learning Research, 2010, 11(1): 2837-2854.
[26] FOWLKES E B, MALLOWS C L. A method for compar-ing two hierarchical clusterings[J]. Journal of the American Statistical Association, 1983, 78(383): 553-569.
[27] ESTER M, KRIEGEL H P, SANDER J. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the 2nd International Con-ference on Knowledge Discovery and Data Mining, Port-land, Aug 2-4, 1996. Menlo Park: AAAI, 1996: 226-231.
[28] HARTIGAN J A, WONG M A. Algorithm AS 136: a K-means clustering algorithm[J]. Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979, 28(1): 100-108.