计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (9): 1250-1261.DOI: 10.3778/j.issn.1673-9418.1507065

• 数据库技术 • 上一篇    下一篇

HashMap优化及其在列存储数据库查询中的应用

母红芬1,李  征1+,霍卫平2,金正皓2   

  1. 1. 北京化工大学 计算机系,北京 100029
    2. 北京东方国信科技股份有限公司,北京 100102
  • 出版日期:2016-09-01 发布日期:2016-09-05

HashMap Optimization and Its Application in Column-Oriented Database Query

MU Hongfen1, LI Zheng1+, HUO Weiping2, JIN Zhenghao2   

  1. 1. Department of Computer Science, Beijing University of Chemical Technology, Beijing 100029, China
    2. Business-Intelligence of Oriental Nations Corporation, Beijing 100102, China
  • Online:2016-09-01 Published:2016-09-05

摘要: HashMap在基本字典操作中具有常数级别的平均算法时间复杂度,广泛应用于大数据的检索。Block_HashMap(BHMap)基于C++ HashMap,其优化包括三方面:哈希函数选取,冲突解决和关键字匹配。优化核心在于冲突解决时,以链地址法为基础,提出了一种高效利用高速缓存的存储结构Block_List来存储冲突的数据,并且预先缓存哈希值,节省匹配时间。实验证明,在桶数目充足的情况下,BHMap会多消耗少部分内存,但在桶数目有限,数据重复率比较低的情况下,时间性能上相对C++标准模板库中的Map提升10倍以上,比unordered_map快3.5倍以上,且消耗的内存与unordered_map相差不大。在列存储数据库分组和连接查询中,关键字的分桶、解决冲突和匹配操作也都涉及到基于哈希的技术,最终把BHMap应用到列存储数据库的关键查询中。

关键词: 哈希图, 分组, 连接, 缓存感知, 缓存不敏感, 列存储数据库, BHMap

Abstract: HashMap has been widely used to retrieve big data because of its constant level in average performance of dictionary operations. Block_HashMap (BHMap) is based on C++ HashMap, in which three optimizations are introduced: Hash function selection, conflict resolution and keyword matching. Conflict resolution is the core of optimization, where Block_list, a storage structure based on the chain address method, is proposed to use cache efficiently and save matching time by store hashcode. In the situation of limited bucket number and low data repetition rate, experiments show that although it consumes a small amount of memory, BHMap has a 3.5 times of C++ unordered_map and 10 times of Map in terms of query speed. In column-oriented database, group by and join are the most commonly used, in which bucket keywords, resolving conflict and matching keywords are all Hash based. Finally the application of BHMap in the query of column-oriented database is provided.

Key words: HashMap, group by, join, cache-conscious, cache-oblivious, column-oriented database, BHMap