基于随机森林的哈希检索算法

doi:10.3778/j.issn.1673-9418.1806031

计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (7): 1174-1183.DOI: 10.3778/j.issn.1673-9418.1806031

基于随机森林的哈希检索算法

花强，郭欣欣，张峰，董春茹+

河北大学河北省机器学习与计算智能重点实验室，河北保定 071002

出版日期:2019-07-01 发布日期:2019-07-08

Hashing Retrieval Method Based on Random Forest

HUA Qiang, GUO Xinxin, ZHANG Feng, DONG Chunru+

Key Laboratory of Machine Learning and Computational Intelligence of Hebei Province, Hebei University, Baoding, Hebei 071002, China

Online:2019-07-01 Published:2019-07-08

摘要/Abstract

摘要： 从海量数据中进行近似数据的检索是数据挖掘领域许多应用的关键。尤其近年来，数据的规模出现爆炸式增长，数据检索需面对海量数据和“维度灾难”的叠加考验，这使得传统最近邻算法效率降低，而近似最近邻算法发挥了越来越重要的作用。其中哈希算法以其在存储空间和计算时间上的优势受到了广泛关注。提出了一种基于随机森林的哈希算法。该算法通过构建随机森林，将原始空间的样本映射为海明空间的二进制哈希码，并在哈希空间上定义了顺序敏感的海明距离，以最大程度保持数据在原空间的近邻关系不变。由于随机森林中不同决策树所使用的特征空间和学习过程是独立的，可以以增量的方式灵活地确定哈希码的长度。此外基于随机森林的哈希编码算法天然适合并行部署，从而可以大大提高算法速度。最后，在MNIST和CIFAR-10数据集对所提算法进行了实验验证，结果表明了算法的有效性和出色性能。

关键词: 近似近邻检索（ANNS）, 哈希编码, 随机森林, 顺序敏感的海明距离

Abstract: The retrieval of approximate data from massive data is the key to many applications in data mining. Especially in the recent years, the scale of data has exploded, and data retrieval needs to face the overlapping test of massive data and “dimension disaster”, which makes the efficiency of traditional nearest neighbor algorithm decrease,while the approximate nearest neighbor algorithms are playing an increasingly important role. Among them, the Hashing algorithm has received extensive attention due to its advantages in storage space and computation time. This paper proposes a Hashing algorithm based on random forest. In the method, by constructing a random forest, the sample of the original space is mapped to the binary Hash code of Hamming space, and an order-sensitive Hamming distance is defined to keep the neighbor relationship of the data in the original space to the maximum extent. Since the feature space and learning process used by different decision trees in a random forest are independent, the length of the Hash code can be flexibly determined in an incremental manner. In addition, the proposed algorithm based on random forest is naturally suitable for parallel deployment, which can greatly improve the speed of the algorithm. Finally, the proposed algorithm is experimentally verified in the MNIST and CIFAR-10 data sets. The results demonstrate the effectiveness and great performance of the algorithm.

Key words: approximate nearest neighbor search (ANNS), Hashing code, random forest, order-sensitive Hamming distance

花强，郭欣欣，张峰，董春茹. 基于随机森林的哈希检索算法[J]. 计算机科学与探索, 2019, 13(7): 1174-1183.

HUA Qiang, GUO Xinxin, ZHANG Feng, DONG Chunru. Hashing Retrieval Method Based on Random Forest[J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(7): 1174-1183.

[1]	武炜杰，张景祥. 有新类的动态数据流分类算法研究[J]. 计算机科学与探索, 2021, 15(1): 132-140.
[2]	尹儒，门昌骞，王文剑. 一种模型决策森林算法[J]. 计算机科学与探索, 2020, 14(1): 108-116.
[3]	周书仁，谢盈，蔡碧野. 融合多尺度特征的深度哈希图像检索方法[J]. 计算机科学与探索, 2018, 12(12): 1974-1986.
[4]	许欧阳，李光辉. 萤火虫优化和随机森林的WSN异常数据检测[J]. 计算机科学与探索, 2018, 12(10): 1633-1644.
[5]	刘弘，江爱文，王明文，万剑怡. 面向近邻搜索的马尔科夫图哈希算法[J]. 计算机科学与探索, 2015, 9(7): 861-868.
[6]	郭颖婕, 刘晓燕, 郭茂祖, 邹权. 植物抗性基因识别中的随机森林分类方法[J]. 计算机科学与探索, 2012, 6(1): 67-77.

基于随机森林的哈希检索算法

Hashing Retrieval Method Based on Random Forest

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics