计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (11): 1334-1344.DOI: 10.3778/j.issn.1673-9418.1407036

• 数据库技术 • 上一篇    下一篇

混合存储下的MapReduce启发式多表连接优化

王  梅+,邢露露,孙  莉   

  1. 东华大学 计算机科学与技术学院,上海 201620
  • 出版日期:2014-11-01 发布日期:2014-11-04

MapReduce Based Heuristic Multi-Join Optimization under Hybrid Storage

WANG Mei+, XING Lulu, SUN Li   

  1. School of Computer Science and Technology, Donghua University, Shanghai 201620, China
  • Online:2014-11-01 Published:2014-11-04

摘要: 对MapReduce下的多表连接查询进行了研究,发现由于MapReduce框架本身的局限性,造成执行效率较低。针对此问题,提出了MapReduce启发式多表连接优化方法(MapReduce based heuristic multi-join optimization,MHMO),为不同的连接模式启发式地推荐不同的执行算法。特别的,对于混合连接,首先将其分组为多个简单连接模式,进而定义代价模型确定各分组的最优执行顺序。结合列存储的延迟物化技术,大大提高了MapReduce下多表连接的执行性能。最后,在数据仓库基准测试数据集TPCH上进行了实验,验证了MHMO的有效性。

关键词: MapReduce, 行列混合存储, 延迟物化, 多表连接优化

Abstract: The MapReduce technology has become one of the key technology for massive data processing. However, the limitation of its computing framework leads to the poor performance in multi-join query analysis tasks. To deal with this problem, this paper proposes an adaptive multi-join optimization method for MapReduce framework, called MHMO (MapReduce based heuristic multi-join optimization). For a given query including multi-join, this paper first constructs the join graph to judge its join pattern, then recommends the “optimal” execution strategy for different patterns. Particularly, for hybrid join, this paper first converts and divides it into a set of simple join patterns, then defines the cost model to choose the execution order between different groups with minimum cost. Integrated with the row-column storage and deferred materialized technology, MHMO can improve the multi-join performance in MapReduce framework significantly. Finally, based on the benchmark dataset TPCH, several experiments are made to testify the effectiveness of MHMO.

Key words: MapReduce, row-column storage, deferred materialized, multi-join optimization