Object Feature Based Deep Hashing for Cross-Modal Retrieval

doi:10.3778/j.issn.1673-9418.2006062

Abstract

Abstract:

With the rapid growth of data with different modalities on the Internet, cross-modal retrieval has gradually become a hot research topic. Due to its efficiency and effectiveness, Hashing based methods have become one of the most popular large-scale cross-modal retrieval strategies. In most of the image-text cross-modal retrieval methods, the goal is to make the deep features of the images similar to the corresponding deep text features. However, these methods incorporate background information of the images into the feature learning, as a result, the retrieval performance is decreased. To solve this problem, OFBDH (object feature based deep Hashing) is proposed to learn optimal discriminative maximum activations of convolutions from the feature maps to represent the object features, and then the learned object features are integrated into the image-text cross-modal network learning. Experimental results show that OFBDH can obtain satisfactory cross-modal retrieval results on MIRFLICKR-25K, IAPR TC-12 and NUS-WIDE.

Key words: object feature, cross-modal loss, network parameters learning, retrieval

摘要：

随着不同模态的数据在互联网中的飞速增长，跨模态检索逐渐成为了当今的一个热点研究问题。哈希检索因其快速、有效的特点，成为了大规模数据跨模态检索的主要方法之一。在众多图像-文本的深度跨模态检索算法中，设计的准则多为尽量使得图像的深度特征与对应文本的深度特征相似。但是此类方法将图像中的背景信息融入到特征学习中，降低了检索性能。为了解决此问题，提出了一种基于对象特征的深度哈希（OFBDH）跨模态检索方法。此方法从特征映射中学习到优化的、有判别力的极大激活特征作为对象特征，并将其融入到图像与文本的跨模态网络学习中。实验结果表明，OFBDH能够在MIRFLICKR-25K、IAPR TC-12和NUS-WIDE三个数据集上获得良好的跨模态检索结果。

关键词: 对象特征, 跨模态损失, 网络参数学习, 检索

ZHU Jie, BAI Hongyu, ZHANG Zhongyu, XIE Bojun, ZHANG Junsan. Object Feature Based Deep Hashing for Cross-Modal Retrieval[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(5): 922-930.

朱杰, 白弘煜, 张仲羽, 谢博鋆, 张俊三. 基于对象特征的深度哈希跨模态检索[J]. 计算机科学与探索, 2021, 15(5): 922-930.

References

[1] ZHANG D, WANG F, SI L, et al. Composite Hashing with multiple information sources[C]//Proceedings of the 2011 International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, Jul 25-29, 2011. New Work: ACM, 2011: 225-234.
[2] KANG Y, KIM S, CHOI S. Deep learning to Hash with multiple representations[C]//Proceedings of the 2012 IEEE 12th International Conference on Data Mining, Brussels, Dec 10-13, 2012. Washington: IEEE Computer Society, 2012: 930-935.
[3] LIN Z J, DING G G, HU M Q, et al. Semantics-preserving Hashing for cross-view retrieval[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Piscataway: IEEE, 2015: 3864-3872.
[4] JIANG Q Y, LI W J. Deep cross-modal Hashing[C]//Pro-ceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Pis-cataway: IEEE, 2017: 3270-3278.
[5] DING G G, GUO Y C, ZHOU J L. Collective matrix factor-ization Hashing for multimodal data[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Piscataway: IEEE, 2014: 2083-2090.
[6] ZHANG D Q, LI W J. Large-scale supervised multimodal Hashing with semantic correlation maximization[C]//Pro-ceedings of the 28th AAAI Conference on Artificial Inte-lligence, Québec City, Jul 27 -31, 2014. Menlo Park: AAAI, 2014: 2177-2183.
[7] KUMAR S, UDUPA R. Learning Hash functions for cross-view similarity search[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Jul 16-22, 2011. Menlo Park: AAAI, 2011: 1360-1365.
[8] XIA S, WANG G, CHEN Z, et al. Complete random forest based class noise filtering learning for improving the genera-lizability of classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 31(11): 2063-2078.
[9] XIA S, LIU Y, DING X, et al. Granular ball computing classifiers for efficient, scalable and robust learning[J]. Information Sciences, 2019, 483: 136-152.
[10] ARBELAEZ P, MAIRE M, FOWLKES C, et al. Contour detection and hierarchical image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 33(5): 898-916.
[11] LIU C X, CHEN L C, SCHROFF F, et al. Auto-deeplab: hierarchical neural architecture search for semantic image segmentation[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 82-92.
[12] DENG J K, GUO J, XUE N N, et al. ArcFace: additive angular margin loss for deep face recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4690-4699.
[13] ZHU J, CHEN Z, ZHAO L, et al. Quadruplet-based deep Hashing for image retrieval[J]. Neurocomputing, 2019, 366: 161-169.
[14] ZHU J, WU S, ZHU H, et al. Multi-center convolutional descriptor aggregation for image retrieval[J]. International Journal of Machine Learning and Cybernetics, 2019, 10(7): 1863-1873.
[15] LIU H, JI R R, WU Y J, et al. Cross-modality binary code learning via fusion similarity Hashing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 6345-6353.
[16] CAO Y, LONG M S, WANG J M, et al. Correlation autoencoder Hashing for supervised cross-modal search[C]//Proceedings of the 2016 ACM International Conference on Multimedia Retrieval, New York, Jun 6-9, 2016. New Work: ACM, 2016: 197-204.
[17] DENG C, CHEN Z, LIU X, et al. Triplet-based deep Hashing network for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2018, 27(8): 3893-3903.
[18] JIANG Q Y, LI W J. Discrete latent factor model for cross-modal Hashing[J]. IEEE Transactions on Image Processing, 2019, 28(7): 3490-3501.
[19] YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Piscataway: IEEE, 2016: 21-29.
[20] SHARMA S, KIROS R, SALAKHUTDINOV R. Action rec-ognition using visual attention[J]. arXiv:1511.04119, 2015.
[21] NOH H, ARAUJO A, SIM J, et al. Large-scale image retrieval with attentive deep local features[C]//Proceedings of the 2017 International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 3476-3485.
[22] YANG E K, DENG C, LIU W, et al. Pairwise relationship guided deep Hashing for cross-modal retrieval[C]//Procee-dings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, Feb 4-9, 2017. Menlo Park: AAAI, 2017:1618-1625.
[23] ZHEN L L, HU P, WANG X, et al. Deep supervised cross-modal retrieval[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 10394-10403.
[24] TOLIAS G, SICRE R, JéGOU H. Particular object retrieval with integral max-pooling of CNN activations[J]. arXiv:1511.05879, 2015.
[25] WANG D, GAO X, WANG X, et al. Multimodal discriminative binary embedding for large-scale cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2016, 25(10): 4540- 4554.
[26] HUISKES M J, LEW M S. The MIR flickr retrieval eval-uation[C]//Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, Vancouver, Oct 30-31, 2008. New Work: ACM, 2008: 39-43.
[27] ESCALANTE H J, HERNANDEZ C A, GONZALEZ J A, et al. The segmented and annotated IAPR TC-12 benchmark[J]. Computer Vision and Image Understanding, 2010, 114(4): 419-428.
[28] CHUA T S, TANG J H, HONG R C, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the 8th ACM International Conference on Image and Video Retrieval, Santorini Island, Jul 8-10, 2009. New York: ACM, 2009: 48.
[29] CHATFIELD K, SIMONYAN K, VEDALDI A, et al. Return of the devil in the details: delving deep into convolutional nets[J]. arXiv:1405.3531, 2014.
[30] LIN Q, CAO W, HE Z,?et al. Semantic deep cross-modal Hashing[J]. Neurocomputing, 2020, 396: 113-122.
[31] ZHANG X, LAI H J, FENG J S. Attention-aware deep adversarial Hashing for cross-modal retrieval[C]//LNCS 11219: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Berlin, Heidelberg: Springer, 2018: 614-629.
[32] HOTELLING H. Relations between two sets of variates[M]//KOTZ S, JOHNSON N L. Breakthroughs in Statistics. Berlin, Heidelberg: Springer, 1992.
[33] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110.
[34] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]// Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 1106-1114.
[35] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1406. 1566, 2014.