计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (3): 489-511.DOI: 10.3778/j.issn.1673-9418.2107076

• 综述·探索 • 上一篇    下一篇

深度学习跨模态图文检索研究综述

刘颖1,2,3,+(), 郭莹莹1, 房杰1,2,3, 范九伦1,3, 郝羽1,3, 刘继明4   

  1. 1.西安邮电大学 图像与信息处理研究所,西安 710121
    2.陕西省无线通信与信息处理技术国际合作研究中心,西安 710121
    3.西安邮电大学 电子信息现场勘验应用技术公安部重点实验室,西安 710121
    4.西安邮电大学 通信与信息工程学院,西安 710121
  • 收稿日期:2021-07-21 修回日期:2021-09-23 出版日期:2022-03-01 发布日期:2021-09-23
  • 通讯作者: + E-mail: liuying_ciip@163.com
  • 作者简介:刘颖(1972—),女,陕西户县人,博士,教授,主要研究方向为图像检索、图像增强等。
    郭莹莹(1995—),女,甘肃陇南人,硕士研究生,主要研究方向为跨模态图文检索。
    房杰(1993—),男,陕西咸阳人,博士,副教授,主要研究方向为视觉影像的语义理解及其应用。
    范九伦(1964—),男,河南温县人,博士,教授,主要研究方向为模式识别、图像处理。
    郝羽(1986—),男,陕西西安人,博士,讲师,主要研究方向为智能视频处理。
    刘继明(1964—),男,福建龙岩人,博士,西安邮电大学特聘教授,主要研究方向为人工智能技术及其产业化。
  • 基金资助:
    国家自然科学基金(62071378)

Survey of Research on Deep Learning Image-Text Cross-Modal Retrieval

LIU Ying1,2,3,+(), GUO Yingying1, FANG Jie1,2,3, FAN Jiulun1,3, HAO Yu1,3, LIU Jiming4   

  1. 1. Center for Image and Information Processing, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
    2. International Joint Research Center for Wireless Communication and Information Processing Technology of Shaanxi Province, Xi’an 710121, China
    3. Key Laboratory of Electronic Information Application Technology for Crime Scene Investigation, Ministry of Public Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
    4. School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
  • Received:2021-07-21 Revised:2021-09-23 Online:2022-03-01 Published:2021-09-23
  • About author:LIU Ying, born in 1972, Ph.D., professor. Her research interests include image retrieval, image enhancement, etc.
    GUO Yingying, born in 1995, M.S. candidate. Her research interest is image-text cross-modal retrieval.
    FANG Jie, born in 1993, Ph.D., associate professor. His research interests include semantic understanding of visual image and its application.
    FAN Jiulun, born in 1964, Ph.D., professor. His research interests include pattern recognition and image processing.
    HAO Yu, born in 1986, Ph.D., lecturer. His research interest is intelligent video processing.
    LIU Jiming, born in 1964, Ph.D., distinguished professor at Xi’an University of Posts and Telecommunications. His research interests include artificial intelligence technology and its industrialization.
  • Supported by:
    National Natural Science Foundation of China(62071378)

摘要:

随着深度神经网络的兴起,多模态学习受到广泛关注。跨模态检索是多模态学习的重要分支,其目的在于挖掘不同模态样本之间的关系,即通过一种模态样本来检索具有近似语义的另一种模态样本。近年来,跨模态检索逐渐成为国内外学术界研究的前沿和热点,是信息检索领域未来发展的重要方向。首先,聚焦于深度学习跨模态图文检索研究的最新进展,对基于实值表示学习和基于二进制表示学习方法的发展动态进行了详细介绍,其中,基于实值表示的方法用于提升跨模态语义相关性,进而提高跨模态检索准确度,基于二进制表示学习的方法用于提升跨模态图文检索效率,减小存储空间;其次,总结了跨模态检索领域常用的公开数据集,对比了不同算法在不同数据集上的性能表现;此外,总结并分析了跨模态图文检索技术在公安、传媒及医学等领域的具体应用情况;最后,结合现有技术探讨了该领域的发展趋势及未来研究方向。

关键词: 跨模态检索, 深度学习, 特征学习, 图文匹配, 实值表示, 二进制表示

Abstract:

As the rapid development of deep neural networks, multi-modal learning techniques are widely concerned. Cross-modal retrieval is an important branch of multimodal learning. Its fundamental purpose is to reveal the relation between different modal samples by retrieving modal samples with identical semantics. In recent years, cross-modal retrieval has gradually become the forefront and hot spot of academic research. It’s an important direction in the future development of information retrieval. This paper focuses on the latest development of cross-modal retrieval based on deep learning, reviews the development trends of real value representation-based and binary representation-based learning methods systematically. Among them, the real value representation-based method is adopted to improve the semantic relevance, and improve the accuracy, and the binary representation-based learning method is used to improve the efficiency of image-text cross-modal retrieval and reduce storage space. In addition, the common open datasets in the field of image-text cross-modal retrieval are summarized, and the performance of various algorithms on different datasets is compared. Especially, this paper summarizes and analyzes the specified implementations of cross-modal retrieval techniques in the fields of public security, media and medicine. Finally, combined with the state-of-the-art technologies, development trends and future research directions are discussed.

Key words: cross-modal retrieval, deep learning, feature learning, image-text matching, real value representation, binary representation

中图分类号: