融合相对密度和最近邻关系的密度峰值聚类

doi:10.3778/j.issn.1673-9418.2205032

摘要/Abstract

摘要： 密度峰值算法在处理密度不均匀的数据时对中心点的选取不准确，并在样本分配时易产生连带错误，导致聚类效果不佳。针对上述问题，提出一种融合相对局部密度和最近邻关系的密度峰值聚类算法。在局部密度的定义中引入稀疏平和权重，提出相对局部密度的定义，根据相对局部密度寻找密度峰值，避免稀疏差异较大的数据集在选取密度峰值时出现的错误，确保中心点选择的正确性；针对分配策略，结合最邻近点准则和阈值限制，提出最近邻分配策略，根据阈值条件有效抑制分配连带错误；基于类内距离均值定义距离比例，提出修正分配策略，提升算法对边界点聚类的准确性。在5个合成数据集和5个UCI数据集上，将提出算法与DPC、DPC-MND、FKNN-DPC、DBSCAN、OPTICS、AP、K-means算法进行比较，实验结果表明，所提算法在调整互信息、调整兰德系数和Fowlkes-Mallows指数上均表现出良好的聚类效果，并通过Friedman检验表明该算法具有最优的性能。

关键词: 聚类算法, 密度峰值, 相对局部密度, 最近邻关系, 分配策略

Abstract: When the density peaks algorithm deals with datasets with different densities, the wrong center points may be selected, and the problem of associated errors may occur in the sample allocation process. To solve the above problems, a density peaks clustering algorithm based on the relative local density and nearest neighbor relationship is proposed. The weights of sparse balance are introduced into the definition of local density, and the definition of relative local density is proposed. The density peak can be found according to the relative local density, which avoids the error of selecting the density peak in the dataset with large sparse differences, and ensures the accuracy of the center point selection. The nearest neighbor allocation strategy is proposed by combining the nearest neighbor criterion and threshold limit to suppress the allocation error effectively. The modified allocation strategy based on the mean value of the distance within the class is proposed to enhance the accuracy of the algorithm for boundary point clustering. The proposed algorithm is compared with DPC, DPC-MND, FKNN-DPC, DBSCAN, OPTICS, AP, and K-means algorithms on 5 synthetic datasets and 5 UCI datasets, and the experimental results demonstrate that the proposed algorithm has sound clustering performance in metrics of adjusted mutual information, adjusted Rand index, and Fowlkes-Mallows index. Friedman test shows that the algorithm has the best performance.

Key words: clustering algorithm, density peaks, relative local density, nearest neighbor relations, allocation strategy

王威娜, 朱钰, 任艳. 融合相对密度和最近邻关系的密度峰值聚类[J]. 计算机科学与探索, 2023, 17(8): 1879-1892.

WANG Weina, ZHU Yu, REN Yan. Density Peaks Clustering Combining Relative Local Density and Nearest Neighbor Relationship[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(8): 1879-1892.

参考文献

[1] LI X, ZHANG H, WANG R, et al. Multiview clustering: a scalable and parameter-free bipartite graph fusion method[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1): 330-344.
[2] 徐金东, 赵甜雨, 冯国政, 等. 基于上下文模糊C均值聚类的图像分割算法[J]. 电子与信息学报, 2021, 43(7): 2079-2086.
XU J D, ZHAO T Y, FENG G Z, et al. Image segmentation algorithm based on context fuzzy C-means clustering[J]. Jour-nal of Electronics & Information Technology, 2021, 43(7): 2079-2086.
[3] 邢海燕, 刘超, 徐成, 等. 基于粒子群优化模糊C焊缝等级磁记忆定量识别模型[J]. 吉林大学学报(工学版), 2022, 52(3): 525-532.
XING H Y, LIU C, XU C, et al. Quantitative metal mag-netic memory classification model of weld grades based on particle swarm optimization fuzzy C-means[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(3): 525-532.
[4] CHEN H, LIANG M, LIU W, et al. An approach to boun-dary detection for 3D point clouds based on DBSCAN clus-tering[J]. Pattern Recognition, 2022, 124: 108431.
[5] 王芙银, 张德生, 张晓. 结合鲸鱼优化算法的自适应密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(3): 94-102.
WANG F Y, ZHANG D S, ZHANG X. Adaptive density peak clustering algorithm combining whale optimization algo-rithm[J]. Computer Engineering and Applications, 2021, 57(3): 94-102.
[6] LIU N, XU Z, ZENG X J, et al. An agglomerative hierarc-hical clustering algorithm for linear ordinal rankings[J]. Infor-mation Sciences, 2021, 557: 170-193.
[7] XU T, JIANG J. A graph adaptive density peaks clustering algorithm for automatic centroid selection and effective aggre-gation[J]. Expert Systems with Applications, 2022, 195: 116539.
[8] 彭启慧, 宣士斌, 高卿. 分布的自动阈值密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(5): 71-78.
PENG Q H, XUAN S B, GAO Q. Distribution automatic threshold density peak clustering algorithm[J]. Computer Engineering and Applications, 2021, 57(5): 71-78.
[9] MELNYKOV V, SARKAR S, MELNYKOV Y. On finite mixture modeling and model-based clustering of directed weighted multilayer networks[J]. Pattern Recognition, 2020,112: 107641.
[10] REZAEE M J, ESHKEVARI M, SABERI M, et al. GBK-means clustering algorithm: an improvement to the K-means algorithm based on the bargaining game[J]. Knowledge-Based Systems, 2021, 213: 106672.
[11] LIKAS A, VLASSIS N, VERBEEK J J. The global k-means clustering algorithm[J]. Pattern Recognition, 2003, 36(2): 451-461.
[12] ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH: an efficient data clustering method for very large databases[J]. ACM SIGMOD Record, 1996, 25(2): 103-114.
[13] SCHUBERT E, SANDER J, ESTER M, et al. DBSCAN revi-sited, revisited: why and how you should (still) use DBSCAN[J]. ACM Transactions on Database Systems, 2017, 42(3): 19.
[14] BUREVA V, SOTIROVA E, POPOV S, et al. Generalized net of cluster analysis process using STING: a statistical information grid approach to spatial data mining[C]//Pro-ceedings of the 12th International Conference on Flexible Query Answering Systems, London, Jun 21-22, 2017. Cham: Springer, 2017: 239-248.
[15] ANDRIYANOV N, TASHLINSKY A, DEMENTIEV V. Detailed clustering based on Gaussian mixture models[C]// Proceedings of the 2020 Intelligent Systems Conference, London, Sep 3-4, 2020. Cham: Springer, 2020: 437-448.
[16] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.
[17] 丁世飞, 徐晓, 王艳茹. 基于不相似性度量优化的密度峰值聚类算法[J]. 软件学报, 2020, 31(11): 3321-3333.
DING S F, XU X, WANG Y R. Optimized density peaks clustering algorithm based on dissimilarity measure[J]. Jou-rnal of Software, 2020, 31(11): 3321-3333.
[18] XIE J, GAO H, XIE W, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weigh-ted K-nearest neighbors[J]. Information Sciences, 2016, 354: 19-40.
[19] 纪霞, 姚晟, 赵鹏. 相对邻域与剪枝策略优化的密度峰值聚类算法[J]. 自动化学报, 2020, 46(3): 562-575.
JI X, YAO C, ZHAO P. Relative neighborhood and pruning strategy optimized density peaks clustering algorithm[J]. Acta Automatica Sinica, 2020, 46(3): 562-575.
[20] 赵嘉, 姚占峰, 吕莉, 等. 基于相互邻近度的密度峰值聚类算法[J]. 控制与决策, 2021, 36(3): 543-552.
ZHAO J, YAO Z F, LV L, et al. Density peaks clustering based on mutual neighbor degree[J]. Control and Decision, 2021, 36(3): 543-552.
[21] 孙林, 秦小营, 徐久成, 等.基于K近邻和优化分配策略的密度峰值聚类算法[J]. 软件学报, 2022, 33(4): 1390-1411.
SUN L, QIN X Y, XU J C, et al. Density peak clustering algorithm based on K-nearest neighbors and optimized allo-cation strategy[J]. Journal of Software, 2022, 33(4): 1390-1411.
[22] BLAKE C L, MERZ C J. UCI repository of machine lear-ning database[EB/OL]. (2016-12-28) [2022-04-20]. http://archive.ics.uci.edu/ml/index.php.
[23] ANKERST M, BREUNIG M M, KRIEGEL H P, et al. OPTICS: ordering points to identify the clustering structure[J]. ACM SIGMOD Record, 1999, 28(2): 49-60.
[24] FREY B J, DUECK D. Clustering by passing messages between data points[J]. Science, 2007, 315(5814): 972-976.
[25] VINH N X, EPPS J, BAILEY J. Information theoretic mea-sures for clusterings comparison: variants, properties, norma-lization and correction for chance[J]. The Journal of Machine Learning Research, 2010, 11(1): 2837-2854.
[26] FOWLKES E B, MALLOWS C L. A method for compa-ring two hierarchical clusterings[J]. Journal of the Ameri-can Statistical Association, 1983, 78(383): 553-569.