Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

Visual Question Answering Model Incorporating Multi-modal Knowledge and Supervised Retrieval

GE Yilin,  SUN Haichun,  YUAN Deyu   

  1. School of Information Network Security, People’s Public Security University of China, Beijing 100038, China

融合多模态知识与有监督检索的视觉问答模型

葛依琳, 孙海春, 袁得嵛   

  1. 中国人民公安大学 信息网络安全学院, 北京 100038

Abstract: Visual Question Answering task aims to answer questions by understanding image content, offering broad application prospects. However, traditional models still face the following challenges: they rely on basic visual features, which hampers their ability to fully capture complex information within images, leading to shortcomings in semantic understanding and external knowledge integration. The introduction of external knowledge often brings noise, affecting the accuracy of retrieval and answer generation. Moreover, the lack of effective supervision mechanisms prevents the optimal utilization of beneficial knowledge, thereby reducing overall question-answering performance. To address these issues, a model utilizing multi-modal knowledge fusion and supervised retrieval was developed. This model includes a multi-modal feature extraction module that integrates image semantic and basic visual features with question semantics and knowledge characteristics for a holistic understanding of the 'question-image' context. A knowledge retrieval module leverages a multi-layer attention mechanism for precise retrieval of relevant knowledge. Utilizing the pretrained BLIP model, the BLIP reading inference module enhances answer accuracy. By incorporating supervised training, the retrieval process is refined to minimize noise interference. Demonstrating robust performance on benchmark datasets such as OK-VQA, FVQA, and VQA 2.0, ablation studies confirm the effectiveness of the model's components. This approach offers a novel solution for knowledge-based visual question answering and illustrates the benefits of multi-modal knowledge fusion and supervised retrieval for improving model performance.

Key words: Visual Question Answering, Knowledge retrieval, Cross-modal, External knowledge

摘要: 视觉问答任务旨在通过理解图像内容回答问题,具有广泛的应用前景。然而,传统模型仍存在以下问题:依赖基础视觉特征,难以充分捕捉图像中的复杂信息,在图像语义理解和外部知识融合上存在不足;引入的外部知识常伴随噪声,影响检索和答案生成的准确性;缺乏有效的监督机制,有益知识难以得到充分利用,从而降低整体问答性能。针对以上问题,提出了一种融合多模态知识与有监督检索的视觉问答模型。该模型由多模态特征提取、基于多模态语义推理的知识检索和基于BLIP的阅读推理模块构成。其中,多模态特征提取模块通过融合图像语义特征、图像基础视觉特征、问题语义特征及知识特征,实现对“问题-图像”的全面理解。基于多模态语义推理的知识检索模块采用多层注意力机制,实现对“问题-图像”相关知识的精准检索。BLIP阅读推理模块则利用预训练的BLIP模型进行答案推理,提升答案生成的准确性。此外,结合有监督训练优化检索过程,减少噪声干扰。实验在OK-VQA、FVQA和VQA2.0等多个基准数据集上均表现优异,通过消融实验进一步验证了模型中各组件的有效性。为融合知识的视觉问答领域提供了新的解决方案,展示了多模态知识融合与有监督检索在提升视觉问答模型性能方面的潜力。

关键词: 视觉问答, 知识检索, 跨模态, 外部知识