Efficient GraphX-Based Distributed Structural Graph Clustering Algorithm

doi:10.3778/j.issn.1673-9418.1709050

Abstract

Abstract: Structural graph clustering is a fundamental algorithm in large graph analysis, which is of great value in many real-world applications, such as component detection, biological function discovery and graph visualization. At present, most of the distributed structural graph clustering algorithms are based on MapReduce framework in Hadoop, however this framework requires a lot of disk I/O overhead and calculates the exact similarities between all adjacent vertices in the graph which increases the computation of the algorithm. To solve the above two problems, this paper proposes two pruning rules, the first to reduce the number of similarity calculation between adjacent vertices and the second to reduce the computation time by calculating the similarity between vertices imprecisely. Then this paper proposes a structural graph clustering algorithm based on GraphX in Spark, called GXDSGC, which saves a lot of disk I/O overhead during operation. Finally, extensive experiments on many real and synthetic datasets show the efficiency and effectiveness of the proposed GXDSGC algorithm. Notably, it performs more than 30 times faster than the compared MapReduce framework algorithm based on Hadoop, which improves the efficiency of the structural graph clustering in graph data analysis.

Key words: Spark, GraphX, distributed computing, graph clustering, community structures

摘要： 结构化图聚类是大图数据分析的主要技术之一，在社区检测、生物功能发现和图可视化等许多实际应用中具有重要意义。目前的分布式结构化图聚类算法大多基于Hadoop的MapReduce框架，但该框架需要精确计算图中所有邻接顶点之间的相似性且需要大量的磁盘I/O开销，极大增加了算法的运行时间。针对以上问题，主要工作和贡献点如下：（1）提出两个削减规则，第一个削减规则用来减少邻接顶点之间相似性计算次数，第二个削减规则通过非精确计算邻接顶点间的相似性来减少计算时间。（2）提出一种基于Spark中GraphX的结构化图聚类算法GXDSGC，该算法在运行期间不需要大量的磁盘I/O开销。（3）通过在大量真实数据集和合成数据集上的实验，证实提出的GXDSGC算法的有效性。GXDSGC算法比基于Hadoop中MapReduce框架的算法快30多倍，能够显著提高结构化图聚类在大图数据分析中的效率。

关键词: Spark, GraphX, 分布式计算, 图聚类, 社区结构

SHI Shengle, ZHAO Yuhai, LI Yuan, YIN Ying, WANG Guoren. Efficient GraphX-Based Distributed Structural Graph Clustering Algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(10): 1571-1582.

时生乐，赵宇海，李源，印莹，王国仁. 一种有效的基于GraphX的分布式结构化图聚类算法[J]. 计算机科学与探索, 2018, 12(10): 1571-1582.

[1]	YOU Fangzhou, BAI Liang. Fast Graph Clustering Algorithm Based on Selection of Key Nodes [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1930-1937.
[2]	WANG Yonggui, XU Shanshan, XIAO Chenglong. Research on Wireless City Community Detection: Using Improved Association Rules to Achieve Community Detection Algorithm on Spark [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(9): 1582-1592.
[3]	ZHAO Shouyue, GE Hongwei. MEPaxos: Consensus Algorithm for Low Latency [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(5): 866-874.
[4]	GUO Yuhan, HU Fangxia. Modeling and Solving for Long-Term Car Pooling Problem Considering Matching Feasibility [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(11): 1894-1910.
[5]	ZHANG Xiaolin, HE Xiaoyu, ZHANG Huanxiang, LI Zhuolin. PLRD-(k,m):Distributed k-Degree-m-Label Anonymity with Protecting Link Rela-tionships [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(1): 70-82.
[6]	QIU Hui, ZOU Zhaonian. SPARQL Query Processing Algorithm on Spark GraphX [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(9): 1361-1371.
[7]	LI Yong, TENG Fei, HUANG Qichuan, LI Tianrui. Parallel Time Series Decomposition Algorithm Based on Spark [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(7): 1055-1063.
[8]	WANG Jianfei, KANG Liangyi, LIU Jie, YE Dan. Distributed Stochastic Variance Reduction Gradient Descent Algorithm topkSVRG [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(7): 1047-1054.
[9]	GAN Ying, WANG Xin, FENG Zhiyong, YANG Yajun. Distributed Graph Coloring Algorithm Based on Pregel Model [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(6): 886-897.
[10]	ZHENG Wenping, LI Jinyu, WANG Jie. Protein Complex Recognition Algorithm Based on Genetic Algorithm [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(5): 794-803.
[11]	ZHANG Yunfei, LI Jin, YUE Kun, LUO Zhihao, LIU Weiyi. Influence Maximization Methods of Correlated Information Propagation [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(12): 1891-1902.
[12]	DENG Shizhuo, XIN Junchang, NIE Tiezheng, WANG Guoren. Big Data Similarity Join Processing Based on Prefix-Suffix Filtering [J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(8): 1235-1245.
[13]	HAN Chao, DUAN Lei, DENG Song, WANG Huifeng, TANG Changjie. Evaluation of Sequential Data Quality Using Spark [J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(6): 897-907.
[14]	WANG Wen, ZHAO Kankan, LI Cuiping, CHEN Hong, SUN Hui. Feature Extension and Category Research for Short Text Based on Spark Platform [J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(5): 732-741.
[15]	WANG Ze'ao, WU Bin, WU Xinyu, ZHANG Zixing. Research and Implementation of Framework for Large-Scale Multi-Dimensional Network Analysis [J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(12): 1941-1952.

Efficient GraphX-Based Distributed Structural Graph Clustering Algorithm

一种有效的基于GraphX的分布式结构化图聚类算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics