Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

Density peak clustering algorithm optimized by weighted shared neighbors

ZHANG Wenjie, XIE Juanying   

  1. School of Computer Science, Shaanxi Normal University, Xi’an 710119, China

加权共享近邻优化的密度峰值聚类算法

张文杰, 谢娟英   

  1. 陕西师范大学 计算机科学学院,西安 710119

Abstract: To address the limitations of DPC (Clustering by fast search and find of Density Peaks) algorithm in its local density definition sensitive to the size of dataset and the cutoff distance dc, and its "domino effect" caused by its single-step assignment strategy for the remaining points, we propose a Density Peak Clustering algorithm based on Weighted Shared Neighbors, referred to WSN-DPC. This algorithm utilizes standard deviation weighted distance to enhance the Euclidean distance, thereby highlighting the contributions of different features to the distances between points. Additionally, shared neighbor information is used to define similarity between points, on which to define the local density and relative distance of a point, so as to reflect the true distribution of points as far as possible. Furthermore, distinct assignment strategies are employed in turn for outliers and non-outliers in the data set, so as to guarantee that each point is to be assigned to its respective cluster. Extensive experimental results across multiple datasets and the statistical tests demonstrate that the proposed WSN-DPC is superior to DPC algorithm and its variants, though the statistically significant is not always there between WSN-DPC and its counterparts. Therefore, WSN-DPC has addressed the limitations of DPC algorithm’s, and is a state-of-the-arts variant of DPC algorithm.

Key words: shared neighbor, local density, weighted distance, cluster center, clustering

摘要: 针对密度峰值聚类算法DPC(Clustering by fast search and find of Density Peaks)的样本局部密度受到数据集规模大小和截断距离dc影响,及其一步分配策略会带来样本分配的“多米诺骨牌效应”,提出基于加权共享近邻优化的密度峰值聚类算法WSN-DPC (Density peak clustering algorithm based on weighted shared neighbors optimization)。算法利用基于标准差加权的距离代替欧氏距离,强化样本不同特征对距离的贡献,利用共享近邻信息定义样本相似度,进而定义样本局部密度和相对距离,以尽可能体现样本真实分布信息。同时,采用不同分配策略对离群点和非离群点依次进行分配,使得每个样本能够尽可能地分配到正确类簇。多个数据集的实验测试和统计性检验结果表明,WSN-DPC算法优于DPC及其改进算法,但不是与所有对比算法均有统计意义上的显著不同。因此,提出的WSN-DPC算法有效地解决了DPC算法的缺陷,成为当前最优的密度峰值聚类算法。

关键词: 共享近邻, 局部密度, 加权距离, 类簇中心, 聚类