Journal of Frontiers of Computer Science and Technology ›› 2020, Vol. 14 ›› Issue (6): 996-1004.DOI: 10.3778/j.issn.1673-9418.1901045

Previous Articles     Next Articles

Incremental Spectral Clustering with Closeness of Shared Nearest Neighbors

ZHAO Mengmeng, WANG Shitong   

  1. 1. School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China
    2. Key Laboratory of Media Design and Software Technology of Jiangsu Province, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2020-06-01 Published:2020-06-04

共享近邻紧密度的增量式谱聚类算法

赵萌萌王士同   

  1. 1. 江南大学 数字媒体学院,江苏 无锡 214122
    2. 江南大学 江苏省媒体设计与软件技术重点实验室,江苏 无锡 214122

Abstract:

The existing spectral clustering algorithm based on the closeness of shared nearest neighbors has been attracting more and more attentions, since it indeed well explores the potential similarity relationship between data points and has strong robustness for data sets that are not completely separated. However, it is still costly in the sense of both running time and memory requirements, which makes its clustering processing ability not very efficient. It has the disadvantages of slow running speed, long running time and invalid algorithm when facing large data sets, and hence becomes impractical for a large-scale data. In order to overcome these drawbacks, its incre-mental version is proposed in this paper. The basic idea is to first decompose the entire data set into its several subsets, and then the proposed spectral clustering algorithm guarantees its promising clustering performance by running on each subset in an incremental way. A lot of experiments on artificial data sets and simulation data sets indicate the effectiveness of the proposed spectral clustering algorithm. At the same time, the algorithm has low time consumption, high clustering accuracy, and can effectively cluster the increasing data sets.

Key words: spectral clustering, shared nearest neighbor, incremental, large-scale data

摘要:

现有的基于共享近邻紧密度的谱聚类算法由于能很好地探索出数据点之间的潜在相似性关系,对未能完全分离的数据集具有健壮性,受到了越来越多的关注。但是,在运行时间和内存需求方面,它要花费的代价仍然十分昂贵,这使得其聚类处理能力不太高效,具有运行速度较慢,运行时间过长,面对大数据集时算法失效等缺点,因此该算法对于大规模数据集来说是不切实际的。为了克服这些缺点,提出了一种它的增量版本。该算法的思想是先将数据集分解为若干子集,然后以增量的方式在每个子集上运行,从而保证其具有良好的聚类性能。通过对人工数据集和仿真数据集进行大量的实验验证了该谱聚类算法的有效性。同时,该算法的时间消耗低,聚类精度高,且能够有效地对不断增加的数据集进行聚类。

关键词: 谱聚类, 共享最近邻, 增量式, 大规模数据