计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (11): 2651-2662.DOI: 10.3778/j.issn.1673-9418.2207017

• 理论·算法 • 上一篇    下一篇

反向近邻构造连通图的聚类算法

龙建武,王强   

  1. 重庆理工大学 计算机科学与工程学院,重庆 400054
  • 出版日期:2023-11-01 发布日期:2023-11-01

Clustering Algorithm for Constructing Connected Graphs with Reverse Nearest Neighbors

LONG Jianwu, WANG Qiang   

  1. School of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400054, China
  • Online:2023-11-01 Published:2023-11-01

摘要: 大数据时代的发展使得聚类算法的应用越来越广泛,但是当前大多数聚类算法对噪声数据比较敏感,并且不能识别非凸形状等复杂结构的数据集。针对该问题,提出一种反向近邻构造连通图的聚类算法。首先,设计一种密度计算方式得到数据点的密度,并构建一种动态的噪声判别器对数据进行去噪,从而削弱噪点对聚类过程的影响;其次,考虑到反向邻居更能体现数据点与周围各点之间的联系,设计一种对去噪后数据构造反向近邻连通图来识别簇内数据结构信息的聚类方法,并利用给定的聚类数合并聚类;最后,对噪点划分聚类时,考虑到仅仅将其划分到距离最近的簇可能导致划分结果不准确,设计一种噪点划分方式,将密度信息考虑到噪点划分聚类中,得到最终的聚类结果。为验证提出方法的有效性,将该方法与其他五种聚类算法的聚类结果进行对比,采用外部评价指标Acc和NMI进行聚类结果的评价。实验结果表明,该算法在非凸形状等复杂结构的含噪数据集上的聚类效果优于对比算法。

关键词: 去噪, 反向邻居, 反向近邻连通图, 聚类

Abstract: The development of big data era makes the application of clustering algorithms more and more widespread, but most current clustering algorithms are sensitive to noisy data and cannot identify datasets with complex structures such as non-convex shapes. To address this problem, a clustering algorithm for constructing connected graph with reverse nearest neighbor is proposed. Firstly, a density calculation is designed to obtain the density of data points, and a dynamic noise discriminator is constructed to denoise the data so as to weaken the effect of noise on the clustering process. Secondly, considering that reverse neighbors better reflect the connection between data points and surrounding points, a clustering method is designed to construct reverse nearest neighbor connectivity graphs for the denoised data to identify data structure information within clusters, and to merge clusters using a given number of clusters. Finally, when classifying the noise points into clusters, considering that only classifying them into the closest clusters may lead to inaccurate classification results, a noise classification method is designed to take the density information into account to obtain the final clustering results. To verify the effectiveness of the proposed method, the clustering results of proposed algorithm are compared with those of five other clustering algorithms, and the external evaluation metrics Acc (cluster accuracy) and NMI (normalized mutual information) are used to evaluate the clustering results. Experimental results show that the clustering results of the proposed algorithm are better than those of comparison algorithms on the noisy datasets with complex structures such as non-convex shapes.

Key words: denoising, reverse neighbors, reverse nearest neighbor connectivity graph, clustering