深度学习跨模态图文检索研究综述

doi:10.3778/j.issn.1673-9418.2107076

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (3): 489-511.DOI: 10.3778/j.issn.1673-9418.2107076

深度学习跨模态图文检索研究综述

刘颖¹^,²^,³^,⁺(), 郭莹莹¹, 房杰¹^,²^,³, 范九伦¹^,³, 郝羽¹^,³, 刘继明⁴

1.西安邮电大学图像与信息处理研究所,西安 710121
2.陕西省无线通信与信息处理技术国际合作研究中心,西安 710121
3.西安邮电大学电子信息现场勘验应用技术公安部重点实验室,西安 710121
4.西安邮电大学通信与信息工程学院,西安 710121

收稿日期:2021-07-21 修回日期:2021-09-23 出版日期:2022-03-01 发布日期:2021-09-23
通讯作者: + E-mail: liuying_ciip@163.com
作者简介:刘颖（1972—）,女,陕西户县人,博士,教授,主要研究方向为图像检索、图像增强等。
郭莹莹（1995—）,女,甘肃陇南人,硕士研究生,主要研究方向为跨模态图文检索。
房杰（1993—）,男,陕西咸阳人,博士,副教授,主要研究方向为视觉影像的语义理解及其应用。
范九伦（1964—）,男,河南温县人,博士,教授,主要研究方向为模式识别、图像处理。
郝羽（1986—）,男,陕西西安人,博士,讲师,主要研究方向为智能视频处理。
刘继明（1964—）,男,福建龙岩人,博士,西安邮电大学特聘教授,主要研究方向为人工智能技术及其产业化。
基金资助:
国家自然科学基金(62071378)

Survey of Research on Deep Learning Image-Text Cross-Modal Retrieval

LIU Ying¹^,²^,³^,⁺(), GUO Yingying¹, FANG Jie¹^,²^,³, FAN Jiulun¹^,³, HAO Yu¹^,³, LIU Jiming⁴

1. Center for Image and Information Processing, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
2. International Joint Research Center for Wireless Communication and Information Processing Technology of Shaanxi Province, Xi’an 710121, China
3. Key Laboratory of Electronic Information Application Technology for Crime Scene Investigation, Ministry of Public Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
4. School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

Received:2021-07-21 Revised:2021-09-23 Online:2022-03-01 Published:2021-09-23
About author:LIU Ying, born in 1972, Ph.D., professor. Her research interests include image retrieval, image enhancement, etc.
GUO Yingying, born in 1995, M.S. candidate. Her research interest is image-text cross-modal retrieval.
FANG Jie, born in 1993, Ph.D., associate professor. His research interests include semantic understanding of visual image and its application.
FAN Jiulun, born in 1964, Ph.D., professor. His research interests include pattern recognition and image processing.
HAO Yu, born in 1986, Ph.D., lecturer. His research interest is intelligent video processing.
LIU Jiming, born in 1964, Ph.D., distinguished professor at Xi’an University of Posts and Telecommunications. His research interests include artificial intelligence technology and its industrialization.
Supported by:
National Natural Science Foundation of China(62071378)

摘要/Abstract

摘要：

随着深度神经网络的兴起,多模态学习受到广泛关注。跨模态检索是多模态学习的重要分支,其目的在于挖掘不同模态样本之间的关系,即通过一种模态样本来检索具有近似语义的另一种模态样本。近年来,跨模态检索逐渐成为国内外学术界研究的前沿和热点,是信息检索领域未来发展的重要方向。首先,聚焦于深度学习跨模态图文检索研究的最新进展,对基于实值表示学习和基于二进制表示学习方法的发展动态进行了详细介绍,其中,基于实值表示的方法用于提升跨模态语义相关性,进而提高跨模态检索准确度,基于二进制表示学习的方法用于提升跨模态图文检索效率,减小存储空间;其次,总结了跨模态检索领域常用的公开数据集,对比了不同算法在不同数据集上的性能表现;此外,总结并分析了跨模态图文检索技术在公安、传媒及医学等领域的具体应用情况;最后,结合现有技术探讨了该领域的发展趋势及未来研究方向。

关键词: 跨模态检索, 深度学习, 特征学习, 图文匹配, 实值表示, 二进制表示

Abstract:

As the rapid development of deep neural networks, multi-modal learning techniques are widely concerned. Cross-modal retrieval is an important branch of multimodal learning. Its fundamental purpose is to reveal the relation between different modal samples by retrieving modal samples with identical semantics. In recent years, cross-modal retrieval has gradually become the forefront and hot spot of academic research. It’s an important direction in the future development of information retrieval. This paper focuses on the latest development of cross-modal retrieval based on deep learning, reviews the development trends of real value representation-based and binary representation-based learning methods systematically. Among them, the real value representation-based method is adopted to improve the semantic relevance, and improve the accuracy, and the binary representation-based learning method is used to improve the efficiency of image-text cross-modal retrieval and reduce storage space. In addition, the common open datasets in the field of image-text cross-modal retrieval are summarized, and the performance of various algorithms on different datasets is compared. Especially, this paper summarizes and analyzes the specified implementations of cross-modal retrieval techniques in the fields of public security, media and medicine. Finally, combined with the state-of-the-art technologies, development trends and future research directions are discussed.

Key words: cross-modal retrieval, deep learning, feature learning, image-text matching, real value representation, binary representation

中图分类号:

TP391

刘颖, 郭莹莹, 房杰, 范九伦, 郝羽, 刘继明. 深度学习跨模态图文检索研究综述[J]. 计算机科学与探索, 2022, 16(3): 489-511.

LIU Ying, GUO Yingying, FANG Jie, FAN Jiulun, HAO Yu, LIU Jiming. Survey of Research on Deep Learning Image-Text Cross-Modal Retrieval[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(3): 489-511.

图/表 23

参考文献 98

[1]	MCGURK H, MACDONALD H. Hearing lips and seeing voices[J]. Nature, 1976, 264(5588): 746-748. DOI URL
[2]	ATREY P K, HOSSAIN M A, SADDIK A EI, et al. Multi-modal fusion for multimedia analysis: a survey[J]. Multi-media Systems, 2010, 16(6): 345-379.
[3]	JIANG X, WU F, ZHANG Y, et al. The classification of multi-modal data with hidden conditional random field[J]. Pattern Recognition Letters, 2015, 51: 63-69. DOI URL
[4]	WANG D X, CUI P, OU M D, et al. Deep multimodal Hash-ing with orthogonal regularization[C]// Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Jul 25-31, 2015. Menlo Park: AAAI, 2015: 2291-2297.
[5]	BALTRUSAITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: a survey and taxonomy[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423-443.
[6]	RAMACHANDRAM D, TAYLOR G W. Deep multimodal learning: a survey on recent advances and trends[J]. IEEE Signal Processing Magazine, 2017, 34(6): 96-108.
[7]	欧卫华, 刘彬, 周永辉, 等. 跨模态检索研究综述[J]. 贵州师范大学学报(自然科学版), 2018, 36(2): 114-120.
	OU W H, LIU B, ZHOU Y H, et al. Survey on the cross-modal retrieval research[J]. Journal of Guizhou Normal Uni-versity (Natural Sciences), 2018, 36(2): 114-120.
[8]	WANG J, HE Y H, KANG C C, et al. Image-text cross-modal retrieval via modality-specific feature learning[C]// Proceed-ings of the 5th ACM on International Conference on Multi-media Retrieval, Shanghai, Jun 23-26, 2015. New York: ACM, 2015: 347-354.
[9]	OTTO C, SPRINGSTEIN M, ANAND A, et al. Characteri-zation and classification of semantic image-text relations[J]. International Journal of Multimedia Information Retrieval, 2020, 9(1): 31-45. DOI URL
[10]	HAROLD H. Relations between two sets of variates[J]. Biometrika, 1936, 28(3): 321-377. DOI URL
[11]	LI D G, DIMITROVA N, LI M K, et al. Multimedia content processing through cross-modal association[C]// Proceedings of the 11th ACM International Conference on Multimedia, Berkeley, Nov 2-8, 2003. New York: ACM, 2003: 604-611.
[12]	RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]// Proceed-ings of the 18th International Conference on Firenze, Oct 25-29, 2010. New York: ACM, 2010: 251-260.
[13]	JI Z Y, YAO W N, WEI W, et al. Deep multi-level semantic Hashing for cross-modal retrieval[J]. IEEE Access, 2019, 7: 23667-23674. DOI URL
[14]	WANG C, YANG H J, MEINEL C. Deep semantic mapping for cross-modal retrieval[C]// Proceedings of the 27th Inter-national Conference on Tools with Artificial Intelligence, Vietri sul Mare, Nov 9-11, 2015. Washington: IEEE Computer Society, 2015: 234-241.
[15]	FENG F X, WANG X J, LI R F. Cross-modal retrieval with correspondence autoencoder[C]// Proceedings of the 2014 ACM International Conference on Multimedia, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 7-16.
[16]	冯方向. 基于深度学习的跨模态检索研究[D]. 北京: 北京邮电大学, 2015.
	FENG F X. Deep learning for cross-modal retrieval[D]. Bei-jing: Beijing University of Posts and Telecommunications, 2015.
[17]	WANG K, YIN Q, WANG W, et al. A comprehensive survey on cross-modal retrieval[J]. arXiv:1607.06215, 2016.
[18]	PENG Y, HUANG X, ZHAO Y. An overview of cross-media retrieval: concepts, methodologies, benchmarks, and chall-enges[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385. DOI URL
[19]	李志义, 黄子风, 许晓绵. 基于表示学习的跨模态检索模型与特征抽取研究综述[J]. 情报学报, 2018, 37(4): 422-435.
	LI Z Y, HUANG Z F, XU X M. A review of the cross-modal retrieval model and feature extraction based on representa-tion learning[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(4): 422-435.
[20]	AYYAVARAIAH M, VENKATESWARLU B. Joint graph regularization based semantic analysis for cross-media retri-eval: a systematic review[J]. International Journal of Eng-ineering & Technology, 2018, 7: 257-261.
[21]	AYYAVARAIAH M, VENKATESWARLU B. Cross media feature retrieval and optimization: a contemporary review of research scope, challenges and objectives[C]// Proceedings of the 3rd International Conference on Computational Vision and Bio Inspired Computing, Coimbatore, Sep 25-26, 2019. Cham: Springer, 2019: 1125-1136.
[22]	邵杰. 基于深度学习的跨模态检索[D]. 北京: 北京邮电大学, 2017.
	SHAO J. Cross-modal retrieval based on deep learning[D]. Beijing: Beijing University of Posts and Telecommunica-tions, 2017.
[23]	赵天. 基于深度学习的跨模态图文检索方法研究[D]. 桂林: 桂林电子科技大学, 2019.
	ZHAO T. Research on the method of cross-modal image and text retrieval based on deep learning[D]. Guilin: Guilin University of Electronic Technology, 2019.
[24]	KAUR P, PANNU H S, MALHI A K. Comparative analysis on cross-modal information retrieval: a review[J]. Computer Science Review, 2021, 39(2): 100336. DOI URL
[25]	HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786): 504-507. DOI URL
[26]	HINTON G E, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2014, 18(7): 1527-1554. DOI URL
[27]	SRIVASTAVA N, SALAHUTDINOV R. Multimodal learn-ing with deep Boltzmann machines[C]// Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Asso-ciates, 2012: 2231-2239.
[28]	HERMANN K M, KOCISKÝ T, GREFENSTETTE E, et al. Teaching machines to read and comprehend[C]// Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, Dec 7-12, 2015. Red Hook: Curran Asso-ciates, 2015: 1693-1701.
[29]	MAO J H, XU W, YANG Y, et al. Explain images with multi-modal recurrent neural networks[J]. arXiv:1410.1090, 2014.
[30]	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// Proceedings of the 27th In-ternational Conference on Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Asso-ciates, 2014: 2672-2680.
[31]	LECUN Y, BOTTOU L, BEBGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. DOI URL
[32]	GOODFELLOW I, BENGIO Y, COURVILLE A. Deep learning[M]. Cambridge: MIT Press, 2016.
[33]	GU J X, WANG Z H, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377. DOI URL
[34]	KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. DOI URL
[35]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 1-9.
[36]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.1556, 2014.
[37]	XIE S N, GIRSHICK R B, DOLLAR P, et al. Aggregated residual transformations for deep neural networks[C]// Pro-ceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Wash-ington: IEEE Computer Society, 2017: 1492-1500.
[38]	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2261-2269.
[39]	JIA Y Q, SHELHAMER E, DONAHUE J, et al. Caffe: con-volutional architecture for fast feature embedding[C]// Pro-ceedings of the 22nd ACM International Conference on Multi-media, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 675-678.
[40]	PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imper-ative style, high-performance deep learning library[C]// Pro-ceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, Dec 8-14, 2019: 8026-8037.
[41]	ABADI M, AGARWAL A, BARHAM P, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems[J]. arXiv:1603.04467, 2016.
[42]	HOCHREITER S, SCHMIDHUBER J. Long short-term me-mory[J]. Neural Computation, 1997, 9(8): 1735-1780. DOI URL
[43]	SCHUSTER M, PALIWAL K K. Bidirectional recurrent ne-ural networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681. DOI URL
[44]	KIM Y. Convolutional neural networks for sentence classifi-cation[J]. arXiv:1408.5882, 2014.
[45]	GIRSHICK R, DONAHUE J, DARRELL T, et al. Region-based convolutional networks for accurate object detection and segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(1): 142-158. DOI URL
[46]	HE Y, XIANG S, KANG C, et al. Cross-modal retrieval via deep and bidirectional representation learning[J]. IEEE Tran-sactions on Multimedia, 2016, 18(7): 1363-1377.
[47]	LI Z, LU W, BAO E, et al. Learning a semantic space by deep network for cross-media retrieval[C]// Proceedings of the 21st International Conference on Distributed Multimedia Systems, Vancouver, Aug 31-Sep 2, 2015. Skokie: Knowledge Systems Institute, 2015: 199-203.
[48]	WEI Y, ZHAO Y, LU C, et al. Cross-modal retrieval with CNN visual features: a new baseline[J]. IEEE Transactions on Cybernetics, 2017, 47(2): 449-460.
[49]	FAN M D, WANG W M, DONG P L, et al. Cross-media re-trieval by learning rich semantic embeddings of multimedia[C]// Proceedings of the 2017 ACM on Multimedia Confer-ence, Mountain View, Oct 23-27, 2017. New York: ACM, 2017: 1698-1706.
[50]	KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recogni-tion, Boston, Jun 8-10, 2015. Washington: IEEE Computer Society, 2015: 3128-3137.
[51]	MALINOWSKI M, ROHRBACH M, FRITZ M. Ask your neurons: a neural-based approach to answering questions about images[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1-9.
[52]	GAO D H, JIN L B, CHEN B, et al. FashionBERT: text and image matching with adaptive loss for cross-modal retrieval[C]// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul 25-30, 2020. New York: ACM, 2020: 2251-2260.
[53]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understand-ing[J]. arXiv:1810.04805, 2018.
[54]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, Dec 4-9, 2017. Cambridge: MIT Press, 2017: 5998-6008.
[55]	LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]// LNCS 11208: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 212-228.
[56]	ZHANG Q, LEI Z, ZHANG Z X, et al. Context-aware atten-tion network for image-text retrieval[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 16-20, 2020. Piscataway: IEEE, 2020: 3533-3542.
[57]	CHEN H, DING G G, LIU X D, et al. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle, Jun 16-20, 2020. Piscataway: IEEE, 2020: 12652-12660.
[58]	NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]// Proceedings of the 28th International Confer-ence on Machine Learning, Bellevue, Jun 28-Jul 2, 2011. Madison: Omnipress, 2011: 689-696.
[59]	KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models[J]. arXiv:1411.2539, 2014.
[60]	XU X, SONG J K, LU H M, et al. Modal-adversarial semantic learning network for extendable cross-modal retrieval[C]// Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, Yokohama, Jun 11-14, 2018. New York: ACM, 2018: 46-54.
[61]	CORNIA M, BARALDI L, TAVAKOLI H R, et al. A unified cycle-consistent neural model for text and image retrieval[J]. Multimedia Tools and Applications, 2020, 79(35): 25697-25721. DOI URL
[62]	CASTREJÓN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]// Proceedings of the 2016 IEEE Conference on Com-puter Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2940-2949.
[63]	ZHEN L L, HU P, WANG X, et al. Deep supervised cross-modal retrieval[C]// Proceedings of the 2019 IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 10394-10403.
[64]	SONG Y, SOLEYMANI M. Polysemous visual-semantic embedding for cross-modal retrieval[C]// Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 1979-1988.
[65]	GU J X, CAI J F, JOTY S R, et al. Look, imagine and match: improving textual-visual cross-modal retrieval with genera-tive models[C]// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 7181-7189.
[66]	WANG B K, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C]// Proceedings of the 2017 ACM on Multimedia Conference, Mountain View, Oct 23-27, 2017. New York: ACM, 2017: 154-162.
[67]	SALAKHUTDINOV R, HINTON G E. Semantic Hashing[J]. International Journal of Approximate Reasoning, 2009, 50(7): 969-978. DOI URL
[68]	XIA R K, PAN Y, LAI H J, et al. Supervised Hashing for image retrieval via image representation learning[C]// Pro-ceedings of the 28th AAAI Conference on Artificial Intelli-gence, Québec, Jul 27-31, 2014. Menlo Park: AAAI, 2014: 2156-2162.
[69]	LIN K, YANG H F, HSIAO J H, et al. Deep learning of binary Hash codes for fast image retrieval[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Rec-ognition Workshops, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 27-35.
[70]	ZHAO F, HUANG Y Z, WANG L, et al. Deep semantic ranking based Hashing for multi-label image retrieval[C]// Proceedings of the 2015 IEEE Conference on Computer Vi-sion and Pattern Recognition, Boston, Jun 8-10, 2015. Wash-ington: IEEE Computer Society, 2015: 1556-1564.
[71]	ZHANG D, WANG F, SI L. Composite Hashing with multiple information sources[C]// Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, Jul 25-29, 2011. New York: ACM, 2011: 225-234.
[72]	CAO Y, LONG M S, WANG J M, et al. Deep visual-semantic Hashing for cross-modal retrieval[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, San Francisco, Aug 13-17, 2016. New York: ACM, 2016: 1445-1454.
[73]	JIANG Q Y, LI W J. Deep cross-modal Hashing[C]// Proceed-ings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3270-3278.
[74]	LI C, DENG C, LI N, et al. Self-supervised adversarial Hash-ing networks for cross-modal retrieval[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Pis-cataway: IEEE, 2018: 4242-4251.
[75]	WU L, WANG Y, SHAO L. Cycle-consistent deep generative Hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2019, 28(4): 1602-1612. DOI URL
[76]	CHUA T S, TANG J H, HONG R C, et al. NUS-WIDE: a real-world web image database from national University of Singapore[C]// Proceedings of the 8th ACM International Conference on Image and Video Retrieval, Santorini Island, Jul 8-10, 2009. New York: ACM, 2009: 1-9.
[77]	LIN T Y, MAIRE M, BELONGIE S J, et al. Microsoft COCO: common objects in context[C]// LNCS 8693: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755.
[78]	YOUNG P, LAI A, HODOSH M, et al. From image descri-ptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. DOI URL
[79]	PEREIRA J C, COVIELLO E, DOYLE G, et al. On the role of correlation and abstraction in cross-modal multimedia re-trieval[J]. IEEE Transactions on Pattern Analysis and Mach-ine Intelligence, 2014, 36(3): 521-535.
[80]	HENNING M, CLOUGH P, MÜLLER H, et al. The IAPR benchmark: a new evaluation resource for visual informa-tion systems[C]// Proceedings of the 2006 International Con-ference on Language Resources and Evaluation, Genova, May 24-26, 2006. European Language Resources Associa-tion, 2006: 13-23.
[81]	刘德鹏. 互联网舆情监控分析系统的研究与实现[D]. 成都: 电子科技大学, 2011.
	LU D P. Design and implementation of Internet opinions monitoring and analyzing system[D]. Chengdu: University of Electronic Science and Technology of China, 2011.
[82]	何奕江. 社交网络跨媒体国民安全事件语义学习与行为分析研究[D]. 北京: 北京邮电大学, 2018.
	HE Y J. Social network cross-media national security incid-ents semantic study and behavioral analysis[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.
[83]	申自强. 基于文本和图像的舆情分析方法研究[D]. 镇江: 江苏大学, 2018.
	SHEN Z Q. Research on public opinion analysis methods based on text and image[D]. Zhenjiang: Jiangsu University, 2018.
[84]	HUANG F R, ZHANG X M, ZHAO Z H, et al. Image-text sentiment analysis via deep multimodal attentive fusion[J]. Knowledge-Based Systems, 2019, 167: 26-37. DOI URL
[85]	曾倩倩, 张婷婷. 基于大数据的图像检索技术在侦查中的应用[J]. 电子技术与软件工程, 2018, 143(21): 175-176.
	ZENG Q Q, ZHANG T T. Application of image retrieval technology based on big data in investigation[J]. Electronic Technology & Software Engineering, 2018, 143(21): 175-176.
[86]	乔凤才. 图文特征联合的证据图像检索技术研究[D]. 长沙: 国防科学技术大学, 2013.
	QIAO F C. Research on the technology of evidence images retrieval using visual and text features[D]. Changsha: Na-tional University of Defense Technology, 2013.
[87]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recogni-tion, Boston, Jun 8-10, 2015. Washington: IEEE Computer Society, 2015: 3156-3164.
[88]	VENUGOPALAN S, XU H, DONAHUE J, et al. Translat-ing videos to natural language using deep recurrent neural networks[J]. arXiv:1412.4729, 2014.
[89]	LAN Z Z, BAO L, YU S I, et al. Multimedia classification and event detection using double fusion[J]. Multimedia Tools & Applications, 2014, 71(1): 333-347.
[90]	CAI G Y, XIA B B. Convolutional neural networks for multi-media sentiment analysis[C]// LNCS 9362: Proceedings of the 4th CCF Conference on Natural Language Processing and Chinese Computing, Nanchang, Oct 9-13, 2015. Cham: Springer, 2015: 159-167.
[91]	凌海彬, 缪裕青, 张万桢, 等. 多特征融合的图文微博情感分析[J]. 计算机应用研究, 2020, 37(7): 1935-1939.
	LIN H B, MIAO Y Q, ZHANG W Z, et al. Multimedia sen-timent analysis on microblog based on multi-feature fusion[J]. Application Research of Computers, 2020, 37(7): 1935-1939.
[92]	CHEN X Y, WANG Y H, LIU Q J. Visual and textual senti-ment analysis using deep fusion convolutional neural net-works[C]// Proceedings of the 2017 IEEE International Con-ference on Image Processing, Beijing, Sep 17-20, 2017. Pis-cataway: IEEE, 2017: 1557-1561.
[93]	ZHANG Y Y, LI G R, CHU L Y, et al. Cross-media topic detection: a multi-modality fusion framework[C]// Proceed-ings of the 2013 IEEE International Conference on Multi-media and Expo, San Jose, Jul 15-19, 2013. Washington: IEEE Computer Society, 2013: 1-6.
[94]	YUAN Z Q, SANG J T, LIU Y, et al. Latent feature learning in social media network[C]// Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, Oct 21-25, 2013. New York: ACM, 2013: 253-262.
[95]	于玉海, 林鸿飞, 孟佳娜, 等. 跨模态多标签生物医学图像分类建模识别[J]. 中国图象图形学报, 2018, 23(6): 917-927.
	YU Y H, LIN H F, MENG J N, et al. Classification modeling and recognition for cross modal and multi-label biomedical image[J]. Journal of Image and Graphics, 2018, 23(6): 917-927.
[96]	COOKE R E, GAETA M G, KAUFMAN D M, et al. Picture archiving and communication system: US6574629 B1[P]. 2003-06-03.
[97]	翟霄. 基于模态网络模型的医学数据检索技术[D]. 哈尔滨: 哈尔滨工程大学, 2017.
	ZHAI X. Medical data retrieval technology based on mode network[D]. Harbin: Harbin Engineering University, 2017.
[98]	ZHEN Y, YEUNG D Y. Co-regularized Hashing for multi-modal data[C]// Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 1385-1393.

类别	代表性方法	特点	适用场景
基于特征表示的方法	MSDS	能够更有效地提取文本特征,但对于样本数量较少的情况性能较差	大规模数据集
	DeCAF	学习到的视觉特征具有足够的表征能力	单标签或多标签样本图文跨模态检索
	deep-SM	提高了对目标数据集的适应性,有效降低了图像与相应语义概念之间的鸿沟,但未能良好建立文本数据的低级特征和高级语义间的关系	单标签或多标签样本图文跨模态检索

类别	代表性方法	特点	适用场景
基于特征表示的方法	MSDS	能够更有效地提取文本特征,但对于样本数量较少的情况性能较差	大规模数据集
	DeCAF	学习到的视觉特征具有足够的表征能力	单标签或多标签样本图文跨模态检索
	deep-SM	提高了对目标数据集的适应性,有效降低了图像与相应语义概念之间的鸿沟,但未能良好建立文本数据的低级特征和高级语义间的关系	单标签或多标签样本图文跨模态检索

类别	代表性方法	特点	适用场景
图像-文本对齐的方法	Neural-Image-QA	减少了爆炸梯度问题,且“单字”变体能使获得的准确率翻倍,实现最佳性能,但答案较长时准确率会迅速下降,且在训练数据点太少或图像有强遮挡等情况时容易匹配失败	现实世界的图像问答任务
	FashionBERT	可掩盖图像中的不相关信息,且补丁能提供不重复且合理相关的信息,但需要大量带注释的图像文本对,在实际场景中不易获得	细粒度的跨模态图文检索任务
	CAAN	检索速度快,模型小,在部署和应用上更方便实用,然而模态间比对和模态内相关性的结合要求更精细的模型设计	单词或图像区域在不同全局上下文中多种语义
	IMRAM	对小规模和大规模数据集都能达到最佳性能,具有鲁棒性	考虑语义复杂性的跨模态检索任务
跨模态重构的方法	SC-NLM	训练模型后可在图像嵌入上对模型调节,且能够不断优化对生成新描述的评分函数,实现对图像和字幕的排序;但不能动态修改用于调节解码器的向量	解决图像字幕生成问题
	MASLN	能够有效缩小不同模态间的统计差距,最大化语义区分能力,但训练模型可能不能很好地反映公共表示中目标集的成对相关性	可扩展跨模态检索任务
	CyTIR-Net	在中小型数据集情况下表现更好,且在训练数据稀缺时能够增强最终检索性能,无需额外注释成本,但在某些情况下无法检索到模态间对应的所有细节信息	中小型数据集
图文联合嵌入的方法	RE-DNN	仅需要很少的模型训练先验知识,同时能解决模态缺失问题,且对大规模数据集可扩展	多媒体信息检索系统
	PVSE	克服了被忽略信息在映射点丢失后无法恢复的问题	多义实例问题
	GXN	能够检索具有局部相似性的图像或具有词级相似性的句子	多义实例问题

类别	代表性方法	特点	适用场景
图像-文本对齐的方法	Neural-Image-QA	减少了爆炸梯度问题,且“单字”变体能使获得的准确率翻倍,实现最佳性能,但答案较长时准确率会迅速下降,且在训练数据点太少或图像有强遮挡等情况时容易匹配失败	现实世界的图像问答任务
	FashionBERT	可掩盖图像中的不相关信息,且补丁能提供不重复且合理相关的信息,但需要大量带注释的图像文本对,在实际场景中不易获得	细粒度的跨模态图文检索任务
	CAAN	检索速度快,模型小,在部署和应用上更方便实用,然而模态间比对和模态内相关性的结合要求更精细的模型设计	单词或图像区域在不同全局上下文中多种语义
	IMRAM	对小规模和大规模数据集都能达到最佳性能,具有鲁棒性	考虑语义复杂性的跨模态检索任务
跨模态重构的方法	SC-NLM	训练模型后可在图像嵌入上对模型调节,且能够不断优化对生成新描述的评分函数,实现对图像和字幕的排序;但不能动态修改用于调节解码器的向量	解决图像字幕生成问题
	MASLN	能够有效缩小不同模态间的统计差距,最大化语义区分能力,但训练模型可能不能很好地反映公共表示中目标集的成对相关性	可扩展跨模态检索任务
	CyTIR-Net	在中小型数据集情况下表现更好,且在训练数据稀缺时能够增强最终检索性能,无需额外注释成本,但在某些情况下无法检索到模态间对应的所有细节信息	中小型数据集
图文联合嵌入的方法	RE-DNN	仅需要很少的模型训练先验知识,同时能解决模态缺失问题,且对大规模数据集可扩展	多媒体信息检索系统
	PVSE	克服了被忽略信息在映射点丢失后无法恢复的问题	多义实例问题
	GXN	能够检索具有局部相似性的图像或具有词级相似性的句子	多义实例问题

类别	代表性方法	特点	适用场景
二进制表示学习方法	DVSH	有效克服了传统融合网络对双峰对象联合嵌入的需求,对参数选择具有鲁棒性	适用高精度,同时能容忍较少最佳检索结果的应用程序
	DCMH	直接学习离散的哈希码,避免了对所学习哈希码准确性的降低,但所需训练时间长	大规模数据集上的人工神经网络搜索
	SSAH	训练时间短,可捕获更精确的模态间相关性	大规模数据集上的人工神经网络搜索
	CYC-DGH	可以有效压缩输入数据,同时能最大限度地保留其自身信息及来自不同模态的样本间关系	有限数量标记样本的跨模态检索任务

深度学习跨模态图文检索研究综述

Survey of Research on Deep Learning Image-Text Cross-Modal Retrieval

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 23

参考文献 98

相关文章 15

编辑推荐

Metrics

数据集名称	年份	图像数量	图像对应标签数量	文本（标签）数量	介绍	来源	示例图像-文本对
NUS-WIDE^[76]	2009	269 648（删除重复图像后）	2~5	5 108（去掉无意义标签后）	多标签定义图像	新加坡国立大学多媒体检索实验室	图7（a）
MSCOCO^[77]	2014	123 287	5	616 435	图像描述	微软公司	图7（b）
Flickr30k^[78]	2014	31 783	5	158 915	图像描述	雅虎网站	图7（c）
Wikipedia^[79]	2014	2 866	1	2 866	图像/文本数据对	维基百科网站	图7（d）
IAPRTC-12^[80]	2006	19 627	1~5	4 576	多标签定义图像	CLEF	图7（e）

模型	文本检索			图像检索
模型	R@1	R@5	R@10	R@1	R@5	R@10
BRNN^[50]	—	—	—	15.2	37.7	50.0
SCAN^[55]	67.4	90.3	95.8	48.6	77.7	85.2
CAAN^[56]	70.1	91.6	97.2	52.8	79.0	87.9
IMRAM^[57]	74.1	93.0	96.6	53.9	79.4	87.2
CyTIR-Net^[61]	36.9	67.8	79.2	21.6	51.8	65.5

模型	文本查询	图像查询	平均值
ACMR^[66]	0.871 0	0.932 0	0.902 0
DVSH^[72]	0.767 3	0.755 2	0.761 3
SSAH^[74]	0.578 0	0.577 0	0.578 0
CYC-DGH^[75]	0.859 0	0.781 0	0.820 0

模型	文本查询	图像查询	平均值
SCM (semantic correlation matching)^[12]	0.226	0.227	0.252
deep-SM^[48]	0.354	0.398	0.376
MSFN+TextNet^[49]	0.453	0.518	0.486
MASLN^[60]（Non-XTD）	0.499	0.623	0.561
MASLN^[60]（XTD）	0.287	0.331	0.309
DSCMR^[63]	0.478	0.521	0.499
ACMR^[66]	0.489	0.619	0.546
CYC-DGH^[75]	0.826	0.820	0.823

模型	文本查询	图像查询	平均值
DeCAF^[47]	0.409	0.486	0.448
deep-SM^[48]	0.776	0.823	0.800
MSFN+TextNet^[49]	0.453	0.518	0.486
MASLN^[60]（Non-XTD）	0.552	0.550	0.551
MASLN^[60]（XTD）	0.284	0.291	0.287
DSCMR^[63]	0.615	0.611	0.613
ACMR^[66]	0.538	0.544	0.541
DCMH^[73]	0.691	0.644	0.667
SSAH^[74]	0.683	0.639	0.661

模型	文本检索					图像检索
模型	top@1	top@2	top@10	top@20	top@100	top@1	top@2	top@10	top@20	top@100
MSDS^[8]	0.256 0	0.364 0	0.626 0	0.740 0	0.921 0	0.255 0	0.362 0	0.633 0	0.742 0	0.919 0
DBRLM^[46]	0.207 0	0.296 0	0.557 0	0.673 0	0.895 0	0.201 0	0.292 0	0.567 0	0.682 0	0.904 0

[1]	安凤平, 李晓薇, 曹翔. 权重初始化-滑动窗口CNN的医学图像分类[J]. 计算机科学与探索, 2022, 16(8): 1885-1897.
[2]	曾凡智, 许露倩, 周燕, 周月霞, 廖俊玮. 面向智慧教育的知识追踪模型研究综述[J]. 计算机科学与探索, 2022, 16(8): 1742-1763.
[3]	刘艺, 李蒙蒙, 郑奇斌, 秦伟, 任小广. 视频目标跟踪算法综述[J]. 计算机科学与探索, 2022, 16(7): 1504-1515.
[4]	赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503.
[5]	夏鸿斌, 肖奕飞, 刘渊. 融合自注意力机制的长文本生成对抗网络模型[J]. 计算机科学与探索, 2022, 16(7): 1603-1610.
[6]	孙方伟, 李承阳, 谢永强, 李忠博, 杨才东, 齐锦. 深度学习应用于遮挡目标检测算法综述[J]. 计算机科学与探索, 2022, 16(6): 1243-1259.
[7]	刘雅芬, 郑艺峰, 江铃燚, 李国和, 张文杰. 深度半监督学习中伪标签方法综述[J]. 计算机科学与探索, 2022, 16(6): 1279-1290.
[8]	程卫月, 张雪琴, 林克正, 李骜. 融合全局与局部特征的深度卷积神经网络算法[J]. 计算机科学与探索, 2022, 16(5): 1146-1154.
[9]	钟梦圆, 姜麟. 超分辨率图像重建算法综述[J]. 计算机科学与探索, 2022, 16(5): 972-990.
[10]	裴利沈, 赵雪专. 群体行为识别深度学习方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 775-790.
[11]	许嘉, 韦婷婷, 于戈, 黄欣悦, 吕品. 题目难度评估方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 734-759.
[12]	朱伟杰, 陈莹. 双流时间域信息交互的微表情识别卷积网络[J]. 计算机科学与探索, 2022, 16(4): 950-958.
[13]	姜艺, 胥加洁, 柳絮, 朱俊武. 边缘指导图像修复算法研究[J]. 计算机科学与探索, 2022, 16(3): 669-682.
[14]	张全贵, 胡嘉燕, 王丽. 耦合用户公共特征的单类协同过滤推荐算法[J]. 计算机科学与探索, 2022, 16(3): 637-648.
[15]	刘利平, 孙建, 高世妍. 单图像盲去模糊方法概述[J]. 计算机科学与探索, 2022, 16(3): 552-564.

模型	文本检索			图像检索
模型	R@1	R@5	R@10	R@1	R@5	R@10
BRNN^[50]	—	—	—	27.4	60.2	74.8
SCAN^[55]	72.7	94.8	98.4	58.8	88.4	94.8
CAAN^[56]	75.5	95.4	98.5	61.3	89.7	95.2
IMRAM^[57]	76.7	95.6	98.5	61.7	89.1	95.0
CyTIR-Net^[61]	44.7	78.0	88.1	28.9	63.8	79.4
PVSE^[64]	55.2	86.5	93.7	69.2	91.6	96.6
GXN^[65]	56.6	—	94.5	68.5	—	97.9

模型	文本检索			图像检索
模型	R@1	R@5	R@10	R@1	R@5	R@10
BRNN^[50]	—	—	—	27.4	60.2	74.8
SCAN^[55]	72.7	94.8	98.4	58.8	88.4	94.8
CAAN^[56]	75.5	95.4	98.5	61.3	89.7	95.2
IMRAM^[57]	76.7	95.6	98.5	61.7	89.1	95.0
CyTIR-Net^[61]	44.7	78.0	88.1	28.9	63.8	79.4
PVSE^[64]	55.2	86.5	93.7	69.2	91.6	96.6
GXN^[65]	56.6	—	94.5	68.5	—	97.9

模型	文本检索			图像检索
模型	R@1	R@5	R@10	R@1	R@5	R@10
BRNN^[50]	—	—	—	10.7	29.6	42.2
SCAN^[55]	50.4	82.2	90.0	38.6	69.3	80.4
CAAN^[56]	52.5	83.3	90.9	41.2	70.3	82.9
IMRAM^[57]	53.7	83.2	91.0	39.7	69.1	79.8
PVSE^[64]	32.4	63.0	75.0	45.2	74.3	84.5
GXN^[65]	31.7	—	74.6	42.0	—	84.7

模型	文本检索			图像检索
模型	R@1	R@5	R@10	R@1	R@5	R@10
BRNN^[50]	—	—	—	10.7	29.6	42.2
SCAN^[55]	50.4	82.2	90.0	38.6	69.3	80.4
CAAN^[56]	52.5	83.3	90.9	41.2	70.3	82.9
IMRAM^[57]	53.7	83.2	91.0	39.7	69.1	79.8
PVSE^[64]	32.4	63.0	75.0	45.2	74.3	84.5
GXN^[65]	31.7	—	74.6	42.0	—	84.7