计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (8): 2203-2218.DOI: 10.3778/j.issn.1673-9418.2407055

• 人工智能·模式识别 • 上一篇    下一篇

融合多模态知识与有监督检索的视觉问答模型

葛依琳,孙海春,袁得嵛   

  1. 中国人民公安大学 信息网络安全学院,北京 100038
  • 出版日期:2025-08-01 发布日期:2025-07-31

Visual Question Answering Model Incorporating Multi-modal Knowledge and Supervised Retrieval

GE Yilin, SUN Haichun, YUAN Deyu   

  1. College of Information and Cyber Security, People??s Public Security University of China, Beijing 100038, China
  • Online:2025-08-01 Published:2025-07-31

摘要: 视觉问答任务旨在通过理解图像内容回答问题,具有广泛的应用前景。然而,传统模型仍存在以下问题:依赖基础视觉特征,难以充分捕捉图像中的复杂信息,在图像语义理解和外部知识融合上存在不足;引入的外部知识常伴随噪声,影响检索和答案生成的准确性;缺乏有效的监督机制,有益知识难以得到充分利用,从而降低整体问答性能。针对以上问题,提出了一种融合多模态知识与有监督检索的视觉问答模型。该模型由多模态特征提取、基于多模态语义推理的知识检索和基于BLIP的阅读推理模块构成。其中,多模态特征提取模块通过融合图像语义特征、图像基础视觉特征、问题语义特征及知识特征,实现对“问题-图像”的全面理解。基于多模态语义推理的知识检索模块采用多层注意力机制,实现对“问题-图像”相关知识的精准检索。BLIP阅读推理模块则利用预训练的BLIP模型进行答案推理,提升答案生成的准确性。此外,结合有监督训练优化检索过程,减少噪声干扰。实验在OK-VQA、FVQA和VQA2.0等多个基准数据集上均表现优异,通过消融实验进一步验证了模型中各组件的有效性。为融合知识的视觉问答领域提供了新的解决方案,展示了多模态知识融合与有监督检索在提升视觉问答模型性能方面的潜力。

关键词: 视觉问答, 知识检索, 跨模态, 外部知识

Abstract: Visual question answering task aims to answer questions by understanding image content, offering broad application prospects. However, traditional models still face the following challenges: they rely on basic visual features, which hampers their ability to fully capture complex information within images, leading to shortcomings in semantic understanding and external knowledge integration. The introduction of external knowledge often brings noise, affecting the accuracy of retrieval and answer generation. Moreover, the lack of effective supervision mechanisms prevents the optimal utilization of beneficial knowledge, thereby reducing overall question-answering performance. To address these issues, a model utilizing multi-modal knowledge fusion and supervised retrieval is developed. This model includes a multi-modal feature extraction module, a knowledge retrieval module, and a BLIP (bootstrapping language-image pre-training) reading inference module. The multi-modal feature extraction module integrates image semantic and basic visual features with question semantics and knowledge characteristics for a holistic understanding of the “question-image” context. The knowledge retrieval module leverages a multi-layer attention mechanism for precise retrieval of relevant knowledge. Utilizing the pretrained BLIP model, the BLIP reading inference module enhances answer accuracy. By incorporating supervised training, the retrieval process is refined to minimize noise interference. Experiments demonstrate superior performance on multiple benchmark datasets including OK-VQA, FVQA, and VQA 2.0. Ablation studies further validate the efficacy of each component in the proposed model. This approach offers a novel solution for knowledge-based visual question answering and illustrates the benefits of multi-modal knowledge fusion and supervised retrieval for improving model performance.

Key words: visual question answering, knowledge retrieval, cross-modal, external knowledge