Visual Question Answering Model Incorporating Multi-modal Knowledge and Supervised Retrieval

doi:10.3778/j.issn.1673-9418.2407055

Abstract

Abstract: Visual Question Answering task aims to answer questions by understanding image content, offering broad application prospects. However, traditional models still face the following challenges: they rely on basic visual features, which hampers their ability to fully capture complex information within images, leading to shortcomings in semantic understanding and external knowledge integration. The introduction of external knowledge often brings noise, affecting the accuracy of retrieval and answer generation. Moreover, the lack of effective supervision mechanisms prevents the optimal utilization of beneficial knowledge, thereby reducing overall question-answering performance. To address these issues, a model utilizing multi-modal knowledge fusion and supervised retrieval was developed. This model includes a multi-modal feature extraction module that integrates image semantic and basic visual features with question semantics and knowledge characteristics for a holistic understanding of the 'question-image' context. A knowledge retrieval module leverages a multi-layer attention mechanism for precise retrieval of relevant knowledge. Utilizing the pretrained BLIP model, the BLIP reading inference module enhances answer accuracy. By incorporating supervised training, the retrieval process is refined to minimize noise interference. Demonstrating robust performance on benchmark datasets such as OK-VQA, FVQA, and VQA 2.0, ablation studies confirm the effectiveness of the model's components. This approach offers a novel solution for knowledge-based visual question answering and illustrates the benefits of multi-modal knowledge fusion and supervised retrieval for improving model performance.

Key words: Visual Question Answering, Knowledge retrieval, Cross-modal, External knowledge

摘要： 视觉问答任务旨在通过理解图像内容回答问题，具有广泛的应用前景。然而，传统模型仍存在以下问题：依赖基础视觉特征，难以充分捕捉图像中的复杂信息，在图像语义理解和外部知识融合上存在不足；引入的外部知识常伴随噪声，影响检索和答案生成的准确性；缺乏有效的监督机制，有益知识难以得到充分利用，从而降低整体问答性能。针对以上问题，提出了一种融合多模态知识与有监督检索的视觉问答模型。该模型由多模态特征提取、基于多模态语义推理的知识检索和基于BLIP的阅读推理模块构成。其中，多模态特征提取模块通过融合图像语义特征、图像基础视觉特征、问题语义特征及知识特征，实现对“问题-图像”的全面理解。基于多模态语义推理的知识检索模块采用多层注意力机制，实现对“问题-图像”相关知识的精准检索。BLIP阅读推理模块则利用预训练的BLIP模型进行答案推理，提升答案生成的准确性。此外，结合有监督训练优化检索过程，减少噪声干扰。实验在OK-VQA、FVQA和VQA2.0等多个基准数据集上均表现优异，通过消融实验进一步验证了模型中各组件的有效性。为融合知识的视觉问答领域提供了新的解决方案，展示了多模态知识融合与有监督检索在提升视觉问答模型性能方面的潜力。

关键词: 视觉问答, 知识检索, 跨模态, 外部知识

GE Yilin, SUN Haichun, YUAN Deyu. Visual Question Answering Model Incorporating Multi-modal Knowledge and Supervised Retrieval[J]. Journal of Frontiers of Computer Science and Technology, DOI: 10.3778/j.issn.1673-9418.2407055.

葛依琳, 孙海春, 袁得嵛. 融合多模态知识与有监督检索的视觉问答模型[J]. 计算机科学与探索, DOI: 10.3778/j.issn.1673-9418.2407055.

[1]	GUO Leming, XUE Wanli, YUAN Tiantian. Multi-scale Visual Feature Extraction and Cross-Modality Alignment for Continuous Sign Language Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2762-2769.
[2]	XU Biqi, MA Zhiqiang, ZHOU Yutong, JIA Wenchao, LIU Jia, LYU Kai. Survey of Research on Knowledge-Driven Dialogue Generation Models [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 58-74.
[3]	WANG Yu, SUN Haichun. Review of Visual Question Answering Technology [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(7): 1487-1505.
[4]	LUO Xuemei, ZHENG Haihong, AN Yaqiang, WANG Di. Online Graph Regularized Non-negative Matrix Factorization Cross-Modal Hashing [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(3): 678-686.
[5]	GU Yuying, GAO Meifeng. Aspect-Level Sentiment Analysis Combining Part-of-Speech and External Knowledge [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(10): 2488-2498.
[6]	SHI Yucheng, WU Yun, LONG Huiyun. Cross-Modal Fusion of RGB-D Salient Detection for Advanced Semantic Repair Strategy [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 140-153.
[7]	LIU Ying, GUO Yingying, FANG Jie, FAN Jiulun, HAO Yu, LIU Jiming. Survey of Research on Deep Learning Image-Text Cross-Modal Retrieval [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(3): 489-511.
[8]	CHEN Ning, DUAN Youxiang, SUN Qifeng. Literature Review of Cross-Modal Retrieval Research [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(8): 1390-1404.
[9]	ZHU Jie, BAI Hongyu, ZHANG Zhongyu, XIE Bojun, ZHANG Junsan. Object Feature Based Deep Hashing for Cross-Modal Retrieval [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(5): 922-930.
[10]	TIAN Xin, JI Yi, GAO Haiyan, LIN Xin, LIU Chunping. Scene Graph Generation Method Based on External Information Guidance and Residual Scrambling [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1958-1968.
[11]	LIN Yang, CHU Xu, WANG Yasha, MAO Weijia, ZHAO Junfeng. Cross-Modal Recipe Retrieval with Self-Attention Mechanism [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(9): 1471-1481.
[12]	JI Zhong, LI Huihui, HE Yuqing. Zero-Shot Multi-Label Image Classification Based on Deep Instance Differentiation [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(1): 97-105.

Visual Question Answering Model Incorporating Multi-modal Knowledge and Supervised Retrieval

融合多模态知识与有监督检索的视觉问答模型

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 12

Recommended Articles

Metrics