
计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (8): 2203-2218.DOI: 10.3778/j.issn.1673-9418.2407055
葛依琳,孙海春,袁得嵛
出版日期:2025-08-01
发布日期:2025-07-31
GE Yilin, SUN Haichun, YUAN Deyu
Online:2025-08-01
Published:2025-07-31
摘要: 视觉问答任务旨在通过理解图像内容回答问题,具有广泛的应用前景。然而,传统模型仍存在以下问题:依赖基础视觉特征,难以充分捕捉图像中的复杂信息,在图像语义理解和外部知识融合上存在不足;引入的外部知识常伴随噪声,影响检索和答案生成的准确性;缺乏有效的监督机制,有益知识难以得到充分利用,从而降低整体问答性能。针对以上问题,提出了一种融合多模态知识与有监督检索的视觉问答模型。该模型由多模态特征提取、基于多模态语义推理的知识检索和基于BLIP的阅读推理模块构成。其中,多模态特征提取模块通过融合图像语义特征、图像基础视觉特征、问题语义特征及知识特征,实现对“问题-图像”的全面理解。基于多模态语义推理的知识检索模块采用多层注意力机制,实现对“问题-图像”相关知识的精准检索。BLIP阅读推理模块则利用预训练的BLIP模型进行答案推理,提升答案生成的准确性。此外,结合有监督训练优化检索过程,减少噪声干扰。实验在OK-VQA、FVQA和VQA2.0等多个基准数据集上均表现优异,通过消融实验进一步验证了模型中各组件的有效性。为融合知识的视觉问答领域提供了新的解决方案,展示了多模态知识融合与有监督检索在提升视觉问答模型性能方面的潜力。
葛依琳, 孙海春, 袁得嵛. 融合多模态知识与有监督检索的视觉问答模型[J]. 计算机科学与探索, 2025, 19(8): 2203-2218.
GE Yilin, SUN Haichun, YUAN Deyu. Visual Question Answering Model Incorporating Multi-modal Knowledge and Supervised Retrieval[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2203-2218.
| [1] AGRAWAL A, LU J S, ANTOL S, et al. VQA: visual question answering[J]. International Journal of Computer Vision, 2017, 123(1): 4-31. [2] WANG P, WU Q, SHEN C H, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI, 2017: 1290-1296. [3] WANG P, WU Q, SHEN C H, et al. FVQA: fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(10): 2413-2427. [4] GUO Y Y, NIE L Q, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069. [5] SALEMI A, ALTMAYER PIZZORNO J, ZAMANI H. A symmetric dual encoding dense retrieval framework for knowledge-intensive visual question answering[C]//Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2023: 110-120. [6] MARINO K, CHEN X L, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14106-14116. [7] GAO F, PING Q, THATTAI G, et al. Transform-retrieve-generate: natural language-centric outside-knowledge visual question answering[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5057-5067. [8] GUI L K, WANG B R, HUANG Q Y, et al. KAT: a knowledge augmented transformer for vision-and-language[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2022: 956-968. [9] LIN W Z, BYRNE B. Retrieval augmented visual question answering with outside knowledge[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2210.03809. [10] GARDèRES F, ZIAEEFARD M, ABELOOS B, et al. Concept-Bert: concept-aware representation for visual question answering[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: ACL, 2020: 489-498. [11] PENNINGTON J, SOCHER R, MANNING C. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543. [12] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. [13] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [14] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. [15] LI X J, YIN X, LI C Y, et al. Oscar: object-semantics aligned pre-training for vision-language tasks[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 121-137. [16] HU Y S, HUA H, YANG Z Y, et al. PromptCap: prompt-guided image captioning for VQA with GPT-3[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2951-2963. [17] HAN X T, YANG J W, HU H D, et al. Image scene graph generation (SGG) benchmark[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2107.12604. [18] ZHU Z H, YU J, WANG Y J, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[C]//Proceedings of the 29th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI, 2020: 1097-1103. [19] ISHMAM M F, SHOVON M S H, MRIDHA M F, et al. From image to language: a critical analysis of visual question answering (VQA) approaches, challenges, and opportunities[J]. Information Fusion, 2024: 102270. [20] SHEVCHENKO V, TENNY D, DICK A, et al. Reasoning over vision and language: exploring the benefits of supplemental knowledge[C]//Proceedings of the 3rd Workshop on Beyond Vision and Language: Integrating Real-World Knowledge, 2021. [21] LIU H, SINGH P. ConceptNet: a practical commonsense reasoning tool-kit[J]. BT Technology Journal, 2004, 22(4): 211-226. [22] CHEN Z, HUANG Y, CHEN J, et al. LaKo: knowledge-driven visual question answering via late knowledge-to-text injection[C]//Proceedings of the 11th International Joint Conference on Knowledge Graphs, 2022: 20-29. [23] WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 2712-2721. [24] AUER S, BIZER C, KOBILAROV G, et al. DBpedia: a nucleus for a web of open data[C]//Proceedings of the 6th International Semantic Web Conference, the 2nd Asian Semantic Web Conference. Berlin, Heidelberg: Springer, 2007: 722-735. [25] TANDON N, DE MELO G, WEIKUM G. WebChild 2.0: fine-grained commonsense knowledge distillation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 115-120. [26] BHAKTHAVATSALAM S, RICHARDSON K, TANDON N, et al. Do dogs have whiskers? A new knowledge base of hasPart relations[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2006.07510. [27] LIN W Z, CHEN J H, MEI J B, et al. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2309.17133. [28] LIN Y Z, XIE Y J, CHEN D D, et al. REVIVE: regional visual representation matters in knowledge-based visual question answering[C]//Advances in Neural Information Processing Systems 35, 2022. [29] YU Z, OUYANG X C, SHAO Z W, et al. Prophet: prompting large language models with complementary answer heuristics for knowledge-based visual question answering[EB/OL]. [2024-05-27]. https://arxiv.org/abs/2303.01903. [30] SALABERRIA A, AZKUNE G, DE LACALLE O L, et al. Image captioning for effective use of language models in knowledge-based visual question answering[J]. Expert Systems with Applications, 2023, 212: 118669. [31] YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 3081-3089. [32] GUO J X, LI J N, LI D X, et al. From images to textual prompts: zero-shot visual question answering with frozen large language models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10867-10877. [33] DAI W L, HOU L, SHANG L F, et al. Enabling multimodal generation on CLIP via vision-language knowledge distillation[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 2383-2395. [34] BENDER E M, KOLLER A. Climbing towards NLU: on meaning, form, and understanding in the age of data[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 5185-5198. [35] LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 12888-12900. [36] RAJPURKAR P, JIA R, LIANG P. Know what you don??t know: unanswerable questions for SQuAD[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 784-789. [37] SACHAN D, REDDY S, HAMILTON W L, et al. End-to-end training of multi-document reader and retriever for open-domain question answering[C]//Advances in Neural Information Processing Systems 34, 2021: 25968-25981. [38] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2024-05-27]. https://arxiv.org/abs/1907.11692. [39] JIAO F K, GUO Y Y, NIU Y L, et al. REPT: bridging language models and machine reading comprehension via retrieval-based pre-training[C]//Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Stroudsburg: ACL, 2021: 150-163. [40] IZACARD G, GRAVE E. Distilling knowledge from reader to retriever for question answering[C]//Proceedings of the 9th International Conference on Learning Representations, 2021. [41] ZHANG P C, LI X J, HU X W, et al. VinVL: revisiting visual representations in vision-language models[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 5575-5584. [42] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008. [43] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2024-05-27]. https://arxiv.org/abs/1810.04805. [44] MISHRA A, ANAND A, GUHA P. VQA with cascade of self- and co-attention blocks[EB/OL]. [2024-05-27]. https://arxiv.org/abs/2302.14777. [45] KARPUKHIN V, OGUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 6769-6781. [46] KIM W J, SON B, KIM I. ViLT: vision-and-language transformer without convolution or region supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 5583-5594. [47] TAN H, BANSAL M. LXMERT: learning cross-modality encoder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 5099-5110. [48] LI L H, YATSKAR M, YIN D, et al. VisualBERT: a simple and performant baseline for vision and language[EB/OL]. [2024-05-27]. https://arxiv.org/abs/1908.03557. [49] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3190-3199. [50] GOYAL Y, KHOT T, AGRAWAL A, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[J]. International Journal of Computer Vision, 2019, 127(4): 398-414. [51] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]//Proceedings of the 7th International Conference on Learning Representations, 2019. [52] KIM J H, JUN J H, ZHANG B T. Bilinear attention networks[C]//Advances in Neural Information Processing Systems 31, 2018: 1571-1581. [53] BEN-YOUNES H, CADENE R, CORD M, et al. MUTAN: multimodal tucker fusion for visual question answering[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2631-2639. [54] LUO M, ZENG Y K, BANERJEE P, et al. Weakly-supervised visual-retriever-reader for knowledge-based question answering[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 6417-6431. [55] LU J S, BATRA D, PARIKH D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL]. [2024-05-26]. https://arxiv.org/abs/1908.02265. [56] BAO H B, WANG W H, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[EB/OL]. [2024-05-26]. https://arxiv.org/abs/2111.02358. [57] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[EB/OL]. [2024-05-26]. https://arxiv.org/abs/2204.14198. [58] WANG W H, BAO H B, DONG L, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19175-19186. [59] DRIESS D, XIA F, SAJJADI M S M, et al. PaLM-E: an embodied multimodal language model[EB/OL]. [2024-05-26]. https://arxiv.org/abs/2303.03378. |
| [1] | 许威, 张晓琳, 张换香, 张景. 联合双粒度图像信息的多模态方面级情感分析[J]. 计算机科学与探索, 2025, 19(9): 2479-2492. |
| [2] | 杨力, 钟俊弘, 张赟, 宋欣渝. 基于复合跨模态交互网络的时序多模态情感分析[J]. 计算机科学与探索, 2024, 18(5): 1318-1327. |
| [3] | 李毅, 李浩, 许骁哲, 杨一凡. CFB:金融领域大模型评估方法[J]. 计算机科学与探索, 2024, 18(12): 3272-3287. |
| [4] | 薛迪, 李欣, 刘明帅. 基于大语言模型的PTCR外部知识型视觉问答框架[J]. 计算机科学与探索, 2024, 18(11): 2912-2924. |
| [5] | 郭乐铭, 薛万利, 袁甜甜. 多尺度视觉特征提取及跨模态对齐的连续手语识别[J]. 计算机科学与探索, 2024, 18(10): 2762-2769. |
| [6] | 许璧麒, 马志强, 周钰童, 贾文超, 刘佳, 吕凯. 知识驱动的对话生成模型研究综述[J]. 计算机科学与探索, 2024, 18(1): 58-74. |
| [7] | 王虞, 孙海春. 视觉问答技术研究综述[J]. 计算机科学与探索, 2023, 17(7): 1487-1505. |
| [8] | 罗雪梅, 郑海红, 安亚强, 王笛. 在线图正则化非负矩阵分解跨模态哈希[J]. 计算机科学与探索, 2023, 17(3): 678-686. |
| [9] | 谷雨影, 高美凤. 融合词性与外部知识的方面级情感分析[J]. 计算机科学与探索, 2023, 17(10): 2488-2498. |
| [10] | 石玉诚, 吴云, 龙慧云. 高级语义修复策略的跨模态融合RGB-D显著性检测[J]. 计算机科学与探索, 2023, 17(1): 140-153. |
| [11] | 刘颖, 郭莹莹, 房杰, 范九伦, 郝羽, 刘继明. 深度学习跨模态图文检索研究综述[J]. 计算机科学与探索, 2022, 16(3): 489-511. |
| [12] | 陈宁, 段友祥, 孙歧峰. 跨模态检索研究文献综述[J]. 计算机科学与探索, 2021, 15(8): 1390-1404. |
| [13] | 朱杰, 白弘煜, 张仲羽, 谢博鋆, 张俊三. 基于对象特征的深度哈希跨模态检索[J]. 计算机科学与探索, 2021, 15(5): 922-930. |
| [14] | 田鑫, 季怡, 高海燕, 林欣, 刘纯平. 外部信息引导和残差置乱的场景图生成方法[J]. 计算机科学与探索, 2021, 15(10): 1958-1968. |
| [15] | 林阳,初旭,王亚沙,毛维嘉,赵俊峰. 融合自注意力机制的跨模态食谱检索方法[J]. 计算机科学与探索, 2020, 14(9): 1471-1481. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||