Journal of Frontiers of Computer Science and Technology ›› 2014, Vol. 8 ›› Issue (7): 790-801.DOI: 10.3778/j.issn.1673-9418.1403027

Previous Articles     Next Articles

Using MapReduce Platform to Achieve Efficient Parallel Mining of Frequent Subgraphs

SUN Heli1, CHEN Qiang2, LIU Wei3, HUANG Jianbin2,4+, ZOU Jianhua1   

  1. 1. School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China
    2. School of Software, Xidian Univeristy, Xi’an 710071, China
    3. School of Information and Communication, Beijing?University?of Posts and Telecommunications, Beijing 100876, China
    4. State Key Laboratory for Novel Software Technology, Nanjing?University, Nanjing 210023, China
  • Online:2014-07-01 Published:2014-07-02


孙鹤立1,陈  强2,刘  玮3,黄健斌2,4+,邹建华1   

  1. 1. 西安交通大学 电子与信息工程学院,西安 710049
    2. 西安电子科技大学 软件学院,西安 710071
    3. 北京邮电大学 信息与通信工程学院,北京 100876
    4. 南京大学 计算机软件新技术国家重点实验室,南京 210023

Abstract: Frequent subgraph mining is an important problem in data mining domain and has been used widely. This paper proposes an efficient algorithm Cloud-GFSG (cloud-global frequent subgraph), by using MapReduce on Hadoop platform for mining frequent subgraphs. The algorithm is based on the principle of Apriori. It uses the discovered frequent subgraphs whose support is k-1 to generate the candidate frequent subgraphs whose support is k when it generates new subgraphs by extending edge. Meanwhile, it checks whether there exists any subgraph which would be generated and sets the frequent subgraph generation rules to ensure the uniqueness of the frequent subgraphs. Compared with the state-of-the-art algorithms, the proposed algorithm has more general function and can avoid generating replicate graphs while extending a new edge. Therefore, its correctness can be ensured and the efficiency had been improved significantly.

Key words: frequent subgraph mining, MapReduce, Hadoop platform

摘要: 频繁子图挖掘是数据挖掘领域的一个重要问题,并且有着广泛的应用。在Hadoop平台上实现了一种基于MapReduce的高效频繁子图挖掘算法Cloud-GFSG(cloud-global frequent subgraph)。该算法基于Apriori思想,在扩展边生成新的子图时,使用已经挖掘出的[k-1]阶的频繁子图生成[k]阶的频繁子图。同时,检查是否存在待扩展生成的子图,设定生成的频繁子图表示规则,保证了频繁子图信息的唯一性。较同类算法相比,该算法在挖掘频繁子图时更具通用性,并且在扩展边时避免产生大量的复制图,从而使得算法的正确性得以保证,且运行效率显著提高。

关键词: 频繁子图挖掘, MapReduce, Hadoop平台