Journal of Frontiers of Computer Science and Technology ›› 2014, Vol. 8 ›› Issue (4): 406-416.DOI: 10.3778/j.issn.1673-9418.1401024

Previous Articles     Next Articles

Efficient Adaptive Clustering Algorithm for Large Scale Information Network

WU Shiji1,2,3, LI Chuan1,2,3+, TANG Changjie1,2, LI Yangtao1,2,3, ZENG Wei1,2,3, YANG Shangqian1,2,3,YANG Ning1,2   

  1. 1. College of Computer Science, Sichuan University, Chengdu 610065, China
    2. National Key Laboratory of Air Control Automation System Technology, Chengdu 610065, China
    3. State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China
  • Online:2014-04-01 Published:2014-04-03

面向大规模信息网络的高效自适应聚类算法

吴诗极1,2,3,李  川1,2,3+,唐常杰1,2,李洋涛1,2,3,曾  卫1,2,3,杨尚乾1,2,3,杨  宁1,2   

  1. 1. 四川大学 计算机学院,成都 610065
    2. 国家空管自动化系统技术重点实验室,成都 610065
    3. 武汉大学 软件工程国家重点实验室,武汉 430072

Abstract: The time cost of traditional clustering algorithm is too high when using it to large scale information network. To solve this issue, based on the statistical characteristic of information network, this paper proposes a novel “divide and conquer” strategy on information network, which reduces the clustering size and time cost heavily without efficiency loss. The main contribution of this paper is three folds: (1) It proposes the idea that clustering in different layers separately after dividing the whole information network into several layers according to the clustering contribution rank; (2) Based on the rich-club phenomenon and hierarchical community feature which exists in information network, it designs the blueprint of layer dividing method of clustering algorithm; (3) It presents an iteration procedure to merge clusters in different layers. The experimental results show that the proposed algorithm has good clustering effect and can reduce time cost.

Key words: information network, adaptive clustering (AC), information layer

摘要: 为解决传统聚类算法在处理大规模信息网络中时间开销过大的问题,基于大规模信息网络的统计学特性,提出了一种将信息网络拓扑结构进行“分而治之”的思想,有效地减少了聚类问题规模和时间开销,并保持了相当的聚类效果。主要贡献包括:提出按照聚类影响力排名来对整个信息网络进行分层切割,然后分别聚类的思想;按照特定信息网络统计学意义上的结构特性,如信息网络的富人集团特性和分层社区结构特性,设计了一套将信息网络进行层次划分的粗略方案,并通过实验证明了其具有一定的合理性;提出了迭代的层级间聚类融合算法,可以实现不同层次聚类的融合。实验表明,该算法在兼具较好聚类效果的同时,非常明显地减少了运算开销。

关键词: 信息网络, 自适应聚类, 信息层