计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (9): 1066-1074.DOI: 10.3778/j.issn.1673-9418.1411045

• 数据库技术 • 上一篇    下一篇

基于Spark的并行图数据分析系统

王虹旭+,吴  斌,刘  旸   

  1. 北京邮电大学 北京市智能通信软件与多媒体重点实验室,北京 100876
  • 出版日期:2015-09-01 发布日期:2015-12-11

Parallel Graph Data Analysis System Based on Spark

WANG Hongxu+, WU Bin, LIU Yang   

  1. Beijing Key Lab of Intelligent Telecommunication Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Online:2015-09-01 Published:2015-12-11

摘要: 提出了一种基于Spark云计算平台的并行数据分析系统。该系统以大规模图数据分析任务为主,并且支持非图数据分析的应用,集成数据分析算法集与非图数据分析算法集。详细阐述了该系统的架构设计,工作流引擎和动态组件更新技术以及部分并行数据分析算法的设计与实现。通过对多种规模的数据集进行性能测试,以及与传统的MapReduce平台进行性能对比,证明了该系统相对于以往的图数据挖掘系统可以更高效地完成计算任务,而且也可以有效进行非图数据分析。

关键词: 云计算, 并行算法, 图数据分析, 数据挖掘, 社会网络分析

Abstract: This paper proposes a parallel data analysis system based on the cloud computing platform of Spark. This system mainly aims at large-scale graph data analysis tasks, supports analysis applications of non-graph data, and integrates the sets of data analysis algorithms and non-graph data analysis algorithms. Then, this paper describes the design and implementation of the system, as well as workflow engine and dynamic component update technology, part of the parallel data analysis algorithms. Through tests of multiple scales of datasets and performance comparison with traditional MapReduce platform, this paper proves that the system is more efficient at completing computing tasks compared with the previous graph data mining system, and can analyze efficiently non-graph data.

Key words: cloud computing, parallel algorithms, graph data analysis, data mining, social network analysis