一种有效的基于GraphX的分布式结构化图聚类算法

doi:10.3778/j.issn.1673-9418.1709050

计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (10): 1571-1582.DOI: 10.3778/j.issn.1673-9418.1709050

一种有效的基于GraphX的分布式结构化图聚类算法

时生乐，赵宇海+，李源，印莹，王国仁

东北大学计算机科学与工程学院，沈阳 110819

出版日期:2018-10-01 发布日期:2018-10-08

Efficient GraphX-Based Distributed Structural Graph Clustering Algorithm

SHI Shengle, ZHAO Yuhai+, LI Yuan, YIN Ying, WANG Guoren

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

Online:2018-10-01 Published:2018-10-08

摘要/Abstract

摘要： 结构化图聚类是大图数据分析的主要技术之一，在社区检测、生物功能发现和图可视化等许多实际应用中具有重要意义。目前的分布式结构化图聚类算法大多基于Hadoop的MapReduce框架，但该框架需要精确计算图中所有邻接顶点之间的相似性且需要大量的磁盘I/O开销，极大增加了算法的运行时间。针对以上问题，主要工作和贡献点如下：（1）提出两个削减规则，第一个削减规则用来减少邻接顶点之间相似性计算次数，第二个削减规则通过非精确计算邻接顶点间的相似性来减少计算时间。（2）提出一种基于Spark中GraphX的结构化图聚类算法GXDSGC，该算法在运行期间不需要大量的磁盘I/O开销。（3）通过在大量真实数据集和合成数据集上的实验，证实提出的GXDSGC算法的有效性。GXDSGC算法比基于Hadoop中MapReduce框架的算法快30多倍，能够显著提高结构化图聚类在大图数据分析中的效率。

关键词: Spark, GraphX, 分布式计算, 图聚类, 社区结构

Abstract: Structural graph clustering is a fundamental algorithm in large graph analysis, which is of great value in many real-world applications, such as component detection, biological function discovery and graph visualization. At present, most of the distributed structural graph clustering algorithms are based on MapReduce framework in Hadoop, however this framework requires a lot of disk I/O overhead and calculates the exact similarities between all adjacent vertices in the graph which increases the computation of the algorithm. To solve the above two problems, this paper proposes two pruning rules, the first to reduce the number of similarity calculation between adjacent vertices and the second to reduce the computation time by calculating the similarity between vertices imprecisely. Then this paper proposes a structural graph clustering algorithm based on GraphX in Spark, called GXDSGC, which saves a lot of disk I/O overhead during operation. Finally, extensive experiments on many real and synthetic datasets show the efficiency and effectiveness of the proposed GXDSGC algorithm. Notably, it performs more than 30 times faster than the compared MapReduce framework algorithm based on Hadoop, which improves the efficiency of the structural graph clustering in graph data analysis.

Key words: Spark, GraphX, distributed computing, graph clustering, community structures

时生乐，赵宇海，李源，印莹，王国仁. 一种有效的基于GraphX的分布式结构化图聚类算法[J]. 计算机科学与探索, 2018, 12(10): 1571-1582.

SHI Shengle, ZHAO Yuhai, LI Yuan, YIN Ying, WANG Guoren. Efficient GraphX-Based Distributed Structural Graph Clustering Algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(10): 1571-1582.

[1]	张培, 祝恩, 蔡志平. 单步划分融合多视图子空间聚类算法[J]. 计算机科学与探索, 2021, 15(12): 2413-2420.
[2]	尤坊州，白亮. 关键节点选择的快速图聚类算法[J]. 计算机科学与探索, 2021, 15(10): 1930-1937.
[3]	王永贵，徐山珊，肖成龙. 无线城市社团发现的研究——在Spark上利用改进关联规则实现社团发现的算法[J]. 计算机科学与探索, 2019, 13(9): 1582-1592.
[4]	赵守月，葛洪伟. MEPaxos：低延迟的共识算法[J]. 计算机科学与探索, 2019, 13(5): 866-874.
[5]	郭羽含，胡芳霞. 考虑匹配可行性的长期合乘问题建模与求解[J]. 计算机科学与探索, 2019, 13(11): 1894-1910.
[6]	楼昀恺，王朝坤. 使用社区结构信息的子图匹配算法优化方法[J]. 计算机科学与探索, 2019, 13(1): 1-22.
[7]	张晓琳，何晓玉，张换香，李卓麟. PLRD-(k,m):保护链接关系的分布式k-度-m-标签匿名方法[J]. 计算机科学与探索, 2019, 13(1): 70-82.
[8]	邱慧，邹兆年. Spark GraphX上的SPARQL查询处理算法[J]. 计算机科学与探索, 2018, 12(9): 1361-1371.
[9]	李勇，滕飞，黄齐川，李天瑞. 基于Spark的时间序列并行分解模型[J]. 计算机科学与探索, 2018, 12(7): 1055-1063.
[10]	王建飞，亢良伊，刘杰，叶丹. 分布式随机方差消减梯度下降算法topkSVRG[J]. 计算机科学与探索, 2018, 12(7): 1047-1054.
[11]	甘瀛，王鑫，冯志勇，杨雅君. 基于Pregel模型的分布式图着色算法[J]. 计算机科学与探索, 2018, 12(6): 886-897.
[12]	郑文萍，李晋玉，王杰. 基于遗传算法的蛋白质复合物识别算法[J]. 计算机科学与探索, 2018, 12(5): 794-803.
[13]	张云飞，李劲，岳昆，罗之皓，刘惟一. 关联影响力传播最大化方法[J]. 计算机科学与探索, 2018, 12(12): 1891-1902.
[14]	邓诗卓，信俊昌，聂铁铮，王国仁. 双缀过滤的大数据相似性连接处理算法[J]. 计算机科学与探索, 2017, 11(8): 1235-1245.
[15]	韩超，段磊，邓松，王慧锋，唐常杰. 基于Spark的序列数据质量评价[J]. 计算机科学与探索, 2017, 11(6): 897-907.

一种有效的基于GraphX的分布式结构化图聚类算法

Efficient GraphX-Based Distributed Structural Graph Clustering Algorithm

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics