Journal of Frontiers of Computer Science and Technology ›› 2023, Vol. 17 ›› Issue (7): 1487-1505.DOI: 10.3778/j.issn.1673-9418.2303025

• Frontiers·Surveys • Previous Articles     Next Articles

Review of Visual Question Answering Technology

WANG Yu, SUN Haichun   

  1. 1. School of Information Network Security, People’s Public Security University of China, Beijing 100038, China
    2. Key Laboratory of Security Technology and Risk Assessment of the Ministry of Public Security, Beijing 100026, China
  • Online:2023-07-01 Published:2023-07-01

视觉问答技术研究综述

王虞,孙海春   

  1. 1. 中国人民公安大学 信息网络安全学院,北京 100038
    2. 安全防范技术与风险评估公安部重点实验室,北京 100026

Abstract: Visual question answering (VQA) is a popular cross-modal task that combines natural language pro-cessing and computer vision techniques. The main objective of this task is to enable computers to intelligently recognize and retrieve visual content and provide accurate answers. VQA involves the integration of multiple technologies such as object recognition and detection, intelligent question answering, image attribute classification, and scene analysis. It can support a wide range of cutting-edge interactive AI tasks such as visual dialogue and visual navigation, and has broad application prospects and great value. Over the past few years, the development of computer vision, natural language processing, and cross-modal AI models has provided many new technologies and methods for achieving the task of visual question answering. This paper mainly summarizes the mainstream models and specialized datasets in the field of visual question answering between 2019 and 2022. Firstly, this paper provides a review and discussion of the mainstream technical methods used in the key steps of implementing the visual question answering task, based on the module framework. Next, it subdivides various types of models in this field according to the technical methods adopted by mainstream models, and briefly introduces their improvement focus and limitations. Then, it summarizes the commonly used datasets and evaluation metrics for visual question answering, and compares and discusses the performance of several typical models. Finally, this paper focuses on the key issues that need to be addressed in the current visual question answering field, and predicts and prospects the future application and technological development in this field.

Key words: visual question answering (VQA), modal fusion, visual dialogue, intelligent question answering, cross-modal technology

摘要: 视觉问答(visual question answering,VQA)是融合自然语言处理与计算机视觉技术的图-文跨模态热门任务。该任务以计算机智能识别与检索图像内容并给出准确答案为主要目标,融合应用了目标识别与检测、智能问答、图像属性分类、场景分析等多项技术,能够支撑许多前沿交互式人工智能高层任务,如视觉对话、视觉导航等,具有广泛的应用前景和极高的应用价值。近几年,计算机视觉、自然语言处理及图-文跨模态领域人工智能模型的发展为视觉问答任务的实现提供了许多新的技术和方法。主要对2019—2022年视觉问答领域的主流模型及专业数据集进行总结。首先,依据视觉问答任务实现的模块框架,对关键步骤中的主流技术方法进行综述讨论。其次,按照主流模型采用的技术方法,将该领域内各类模型进行细分,并简要介绍改进重点和局限性。随后,综述视觉问答常用数据集与评价指标,对几类典型模型性能进行对比阐述。最后,对现阶段视觉问答领域内亟待解决的问题进行重点阐述,并对视觉问答领域未来应用及技术发展进行预测和展望。

关键词: 视觉问答(VQA), 模态融合, 视觉对话, 智能问答, 跨模态技术