计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (10): 1163-1171.DOI: 10.3778/j.issn.1673-9418.1412065

• 高性能计算 • 上一篇    下一篇

基于矩阵计算的并行谱聚类方法

张鲁飞+,郝子宇,陈左宁   

  1. 数学工程与先进计算国家重点实验室,江苏 无锡 214125
  • 出版日期:2015-10-01 发布日期:2015-09-29

Parallel Spectral Clustering Algorithm Based on Matrix Computation

ZHANG Lufei+, HAO Ziyu, CHEN Zuoning   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125, China
  • Online:2015-10-01 Published:2015-09-29

摘要: 针对大规模社交网络的聚类研究由来已久,谱聚类方法的可扩展性也一直是研究难点。近年来,基于代数图论发展出来的谱聚类方法,利用了特征值对应的谱结构,降低了计算复杂度且保证了聚类质量,是新的研究热点。但是在图的规模比较大和聚类个数比较多的情况下,中间运算结果会突破单机内存限制,必须将谱聚类方法并行化。为解决上述问题,提出了基于矩阵计算的并行谱聚类方法。首先利用矩阵计算领域中形成的大量的高效算法以及成熟的软件解决了特征值分解问题,将大规模的图进行降维,有效地支持原型系统的快速开发。其次使用稀疏矩阵的分片压缩存储,并用MPI(message passing interface)模型实现了矩阵-向量乘等基本算子,提高了系统的可扩展性及可靠性。最后通过实验表明提出的并行谱聚类方法可有效地解决聚类问题所面临的并发度高和平台复杂的挑战,进而支持挖掘蕴藏在海量数据资源中的有价值信息。

关键词: 矩阵计算, 谱聚类, 特征值分解, 社交网络

Abstract: Finding community structure of large-scale graph is always an important problem and difficult to researchers. Spectral clustering algorithm based on algebraic graph theory is an emerging algorithm which uses spectrum structure to reduce computation complexity and ensure the clustering quality. But it can hardly scale because the eigenvalue decomposition of the Laplacian matrix is memory-consumptive and difficult to explore its parallelism. To overcome the problems, this paper proposes a parallel spectral clustering algorithm based on matrix computation. Firstly, this paper uses state-of-art matrix computation tools to solve the problem of eigenvalue decomposition, and reduces the dimension of large-scale graph to support the rapid development of prototype system. Secondly, this paper develops some basic operators such as matrix-vector multiplication with the blocked compression and storage of sparse matrix and MPI (message passing interface) programming model, and improves the scalability and reliability of the system. Finally, the experimental results show that the proposed parallel spectral clustering algorithm can effectively solve the problems of concurrency and complexity, then mine the valuable information in huge amounts of data resources.

Key words: matrix computation, spectral clustering, eigenvalue decomposition, social network