Journal of Frontiers of Computer Science and Technology ›› 2018, Vol. 12 ›› Issue (10): 1571-1582.DOI: 10.3778/j.issn.1673-9418.1709050

Previous Articles     Next Articles

Efficient GraphX-Based Distributed Structural Graph Clustering Algorithm

SHI Shengle, ZHAO Yuhai+, LI Yuan, YIN Ying, WANG Guoren   

  1. School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Online:2018-10-01 Published:2018-10-08

一种有效的基于GraphX的分布式结构化图聚类算法

时生乐赵宇海+,李  源,印  莹,王国仁   

  1. 东北大学 计算机科学与工程学院,沈阳 110819

Abstract: Structural graph clustering is a fundamental algorithm in large graph analysis, which is of great value in many real-world applications, such as component detection, biological function discovery and graph visualization. At present, most of the distributed structural graph clustering algorithms are based on MapReduce framework in   Hadoop, however this framework requires a lot of disk I/O overhead and calculates the exact similarities between all adjacent vertices in the graph which increases the computation of the algorithm. To solve the above two problems, this paper proposes two pruning rules, the first to reduce the number of similarity calculation between adjacent vertices and the second to reduce the computation time by calculating the similarity between vertices imprecisely. Then this paper proposes a structural graph clustering algorithm based on GraphX in Spark, called GXDSGC, which saves a lot of disk I/O overhead during operation. Finally, extensive experiments on many real and synthetic datasets show the efficiency and effectiveness of the proposed GXDSGC algorithm. Notably, it performs more than 30 times faster than the compared MapReduce framework algorithm based on Hadoop, which improves the efficiency of the structural graph clustering in graph data analysis.

Key words: Spark, GraphX, distributed computing, graph clustering, community structures

摘要: 结构化图聚类是大图数据分析的主要技术之一,在社区检测、生物功能发现和图可视化等许多实际应用中具有重要意义。目前的分布式结构化图聚类算法大多基于Hadoop的MapReduce框架,但该框架需要精确计算图中所有邻接顶点之间的相似性且需要大量的磁盘I/O开销,极大增加了算法的运行时间。针对以上问题,主要工作和贡献点如下:(1)提出两个削减规则,第一个削减规则用来减少邻接顶点之间相似性计算次数,第二个削减规则通过非精确计算邻接顶点间的相似性来减少计算时间。(2)提出一种基于Spark中GraphX的结构化图聚类算法GXDSGC,该算法在运行期间不需要大量的磁盘I/O开销。(3)通过在大量真实数据集和合成数据集上的实验,证实提出的GXDSGC算法的有效性。GXDSGC算法比基于Hadoop中MapReduce框架的算法快30多倍,能够显著提高结构化图聚类在大图数据分析中的效率。

关键词: Spark, GraphX, 分布式计算, 图聚类, 社区结构