计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (1): 40-50.DOI: 10.3778/j.issn.1673-9418.1305052

• 数据库技术 • 上一篇    下一篇

BHP:面向BSP模型的负载均衡Hash图数据划分

周  爽1,鲍玉斌1+,王志刚1,冷芳玲1,于  戈1,邓  超2,郭磊涛2   

  1. 1. 东北大学 信息科学与工程学院,沈阳 110819
    2. 中国移动通信研究院 业务支撑研究所,北京 100053
  • 出版日期:2014-01-01 发布日期:2014-01-03

BHP:BSP Model Oriented Hash Graph Data Partition with Load Balancing

ZHOU Shuang1, BAO Yubin1+, WANG Zhigang1, LENG Fangling1, YU Ge1, DENG Chao2, GUO Leitao2   

  1. 1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
    2. Division for Business Support, China Mobile Institute, Beijing 100053, China
  • Online:2014-01-01 Published:2014-01-03

摘要: 图数据划分是基于BSP(bulk synchronous parallel)编程模型的大规模图处理系统中一个关键技术问题。传统的图划分技术需要多次迭代,时间复杂度过高,且划分结果不具有图顶点到分区的映射信息,因此这些算法并不适用于BSP模型下的数据划分。提出了一种新的面向BSP模型的负载均衡Hash数据划分算法(balanced Hash partition,BHP)。为了实现各个分区的出边数尽可能均衡,该算法引入了虚拟桶的概念,通过贪婪算法将虚拟桶重组为实际分区,保证了每个实际分区负载均衡,同时数据本地化策略使本分片上的数据尽可能地保留在本节点上,从而减小在数据加载时的数据迁移开销。从三个方面对比了BHP算法和经典Hash算法的性能,结果表明BHP算法能够提高作业的执行效率,减少消息发送的数量,有效解决了经典Hash算法的负载不均衡和分区间交互边过多的问题,当数据量变大时,效果尤为明显。

关键词: BSP模型, 图划分, 分布式系统, 负载均衡, 虚拟桶

Abstract: Graph data partition is a critical technical problem in the large-scale graph processing system based on bulk synchronous parallel (BSP) model. Traditional graph partition technology requires many iterations, the time complexity is too high, and the partition results don’t have the mapping information from vertexes to partitions. So those algorithms are not suitable for partitioning graph data for the BSP model based systems. This paper presents a new Hash partition algorithm that implements load balance, called balanced Hash partition (BHP). In order to achieve the balance of the number of out degrees in each partition as much as possible, the concept of virtual bucket is introduced. Then these virtual buckets are reorganized into desired partitions by using a greedy algorithm. In this way, the algorithm can ensure the load balance of each partition. At the same time, the data localization strategy makes the data on the split locate on the corresponding node as much as possible. So, the overhead of data migration during the data loading can be reduced. This paper does some experiments to compare the performance between BHP and the classic Hash algorithm from multiple aspects. The results show that the BHP algorithm can improve the job executing efficiency, reduce the number of sending messages, and effectively solve the problems of load imbalance and too many interactive edges among partitions.

Key words: bulk synchronous parallel (BSP), graph partition, distributed system, load balance, virtual bucket