计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (7): 1174-1183.DOI: 10.3778/j.issn.1673-9418.1806031

• 人工智能 • 上一篇    下一篇

基于随机森林的哈希检索算法

花  强,郭欣欣,张  峰,董春茹+   

  1. 河北大学 河北省机器学习与计算智能重点实验室,河北 保定 071002
  • 出版日期:2019-07-01 发布日期:2019-07-08

Hashing Retrieval Method Based on Random Forest

HUA Qiang, GUO Xinxin, ZHANG Feng, DONG Chunru+   

  1. Key Laboratory of Machine Learning and Computational Intelligence of Hebei Province, Hebei University, Baoding, Hebei 071002, China
  • Online:2019-07-01 Published:2019-07-08

摘要: 从海量数据中进行近似数据的检索是数据挖掘领域许多应用的关键。尤其近年来,数据的规模出现爆炸式增长,数据检索需面对海量数据和“维度灾难”的叠加考验,这使得传统最近邻算法效率降低,而近似最近邻算法发挥了越来越重要的作用。其中哈希算法以其在存储空间和计算时间上的优势受到了广泛关注。提出了一种基于随机森林的哈希算法。该算法通过构建随机森林,将原始空间的样本映射为海明空间的二进制哈希码,并在哈希空间上定义了顺序敏感的海明距离,以最大程度保持数据在原空间的近邻关系不变。由于随机森林中不同决策树所使用的特征空间和学习过程是独立的,可以以增量的方式灵活地确定哈希码的长度。此外基于随机森林的哈希编码算法天然适合并行部署,从而可以大大提高算法速度。最后,在MNIST和CIFAR-10数据集对所提算法进行了实验验证,结果表明了算法的有效性和出色性能。

关键词: 近似近邻检索(ANNS), 哈希编码, 随机森林, 顺序敏感的海明距离

Abstract: The retrieval of approximate data from massive data is the key to many applications in data mining. Especially in the recent years, the scale of data has exploded, and data retrieval needs to face the overlapping test of massive data and “dimension disaster”, which makes the efficiency of traditional nearest neighbor algorithm decrease,while the approximate nearest neighbor algorithms are playing an increasingly important role. Among them, the Hashing algorithm has received extensive attention due to its advantages in storage space and computation time. This paper proposes a Hashing algorithm based on random forest. In the method, by constructing a random forest, the sample of the original space is mapped to the binary Hash code of Hamming space, and an order-sensitive Hamming distance is defined to keep the neighbor relationship of the data in the original space to the maximum extent. Since the feature space and learning process used by different decision trees in a random forest are independent, the length of the Hash code can be flexibly determined in an incremental manner. In addition, the proposed algorithm based on random forest is naturally suitable for parallel deployment, which can greatly improve the speed of the algorithm. Finally, the proposed algorithm is experimentally verified in the MNIST and CIFAR-10 data sets. The results demonstrate the effectiveness and great performance of the algorithm.

Key words: approximate nearest neighbor search (ANNS), Hashing code, random forest, order-sensitive Hamming distance