计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (4): 574-585.DOI: 10.3778/j.issn.1673-9418.1806016

• 数据挖掘 • 上一篇    下一篇

K-means型多视图聚类中的初始化问题研究

洪  敏1,2,贾彩燕1,2+,王晓阳1,2   

  1. 1. 北京交通大学 计算机与信息技术学院,北京 100044
    2. 交通数据分析与挖掘北京市重点实验室,北京 100044
  • 出版日期:2019-04-01 发布日期:2019-04-10

Research on Initialization of K-means Type Multi-View Clustering

HONG Min1,2, JIA Caiyan1,2+, WANG Xiaoyang1,2   

  1. 1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
    2. Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing 100044, China
  • Online:2019-04-01 Published:2019-04-10

摘要: 在K-means型多视图聚类算法中,最终的聚类结果会受到初始类中心的影响。因此研究了不同的初始中心选择方法对K-means型多视图聚类算法的影响,并提出一种基于采样的主动式初始中心选择方法(sampled- clustering by fast search and find of density peaks,SDPC)。该方法通过对数据集进行均匀采样,利用密度峰值快速搜索聚类算法(clustering by fast search and find of density peaks,DPC),以及K-means再迭代策略,进一步改善多视图聚类中的初始中心选择效率和类个数问题。实验验证了不同初始化方法对K-means型多视图聚类算法的影响。多视图基准数据集上的实验结果表明:全局(核)K-means初始化方法存在时间复杂度过高的问题,AFKMC2(assumption-free K-Markov chain Monte Carlo)初始化适用于大规模数据,DPC可以主动选择类个数和初始类中心,SDPC较DPC而言,不仅能主动式获得类个数,还在聚类精度和效率上取得了较好的折衷。

关键词: 多视图, 类初始化, 聚类

Abstract: In K-means-based multi-view clustering algorithms, the final clustering results will be affected by initial cluster centers. Therefore, this paper studies the effect of different initial center selection methods for K-means type multi-view clustering algorithms, and proposes an active initial center selection method SDPC (sampled-clustering by fast search and find of density peaks), which performs uniform sampling on the dataset, then uses DPC (clustering by fast search and find of density peaks), and adopts a K-means reiterative strategy so as to further improve the efficiency of selecting the number of clusters and initial centers in multi-view clustering. Experiments show the effect of different initialization methods on K-means type multi-view clustering algorithms. According to experimental results on multi-view benchmark datasets, the global (kernel) K-means initialization has a high time complexity, AFKMC2 (assumption-free K-Markov chain Monte Carlo) initialization is suitable for large-scale data, DPC can be used to actively select cluster numbers and initial centers, and SDPC can not only obtain initial cluster centers actively, but also make a good trade-off between clustering accuracy and efficiency.

Key words: multi-view, clustering initialization, clustering