
Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (8): 2203-2218.DOI: 10.3778/j.issn.1673-9418.2407055
• Artificial Intelligence·Pattern Recognition • Previous Articles Next Articles
GE Yilin, SUN Haichun, YUAN Deyu
Online:2025-08-01
Published:2025-07-31
葛依琳,孙海春,袁得嵛
GE Yilin, SUN Haichun, YUAN Deyu. Visual Question Answering Model Incorporating Multi-modal Knowledge and Supervised Retrieval[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2203-2218.
葛依琳, 孙海春, 袁得嵛. 融合多模态知识与有监督检索的视觉问答模型[J]. 计算机科学与探索, 2025, 19(8): 2203-2218.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2407055
| [1] AGRAWAL A, LU J S, ANTOL S, et al. VQA: visual question answering[J]. International Journal of Computer Vision, 2017, 123(1): 4-31. [2] WANG P, WU Q, SHEN C H, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI, 2017: 1290-1296. [3] WANG P, WU Q, SHEN C H, et al. FVQA: fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(10): 2413-2427. [4] GUO Y Y, NIE L Q, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069. [5] SALEMI A, ALTMAYER PIZZORNO J, ZAMANI H. A symmetric dual encoding dense retrieval framework for knowledge-intensive visual question answering[C]//Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2023: 110-120. [6] MARINO K, CHEN X L, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14106-14116. [7] GAO F, PING Q, THATTAI G, et al. Transform-retrieve-generate: natural language-centric outside-knowledge visual question answering[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5057-5067. [8] GUI L K, WANG B R, HUANG Q Y, et al. KAT: a knowledge augmented transformer for vision-and-language[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2022: 956-968. [9] LIN W Z, BYRNE B. Retrieval augmented visual question answering with outside knowledge[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2210.03809. [10] GARDèRES F, ZIAEEFARD M, ABELOOS B, et al. Concept-Bert: concept-aware representation for visual question answering[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: ACL, 2020: 489-498. [11] PENNINGTON J, SOCHER R, MANNING C. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543. [12] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. [13] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [14] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. [15] LI X J, YIN X, LI C Y, et al. Oscar: object-semantics aligned pre-training for vision-language tasks[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 121-137. [16] HU Y S, HUA H, YANG Z Y, et al. PromptCap: prompt-guided image captioning for VQA with GPT-3[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2951-2963. [17] HAN X T, YANG J W, HU H D, et al. Image scene graph generation (SGG) benchmark[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2107.12604. [18] ZHU Z H, YU J, WANG Y J, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[C]//Proceedings of the 29th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI, 2020: 1097-1103. [19] ISHMAM M F, SHOVON M S H, MRIDHA M F, et al. From image to language: a critical analysis of visual question answering (VQA) approaches, challenges, and opportunities[J]. Information Fusion, 2024: 102270. [20] SHEVCHENKO V, TENNY D, DICK A, et al. Reasoning over vision and language: exploring the benefits of supplemental knowledge[C]//Proceedings of the 3rd Workshop on Beyond Vision and Language: Integrating Real-World Knowledge, 2021. [21] LIU H, SINGH P. ConceptNet: a practical commonsense reasoning tool-kit[J]. BT Technology Journal, 2004, 22(4): 211-226. [22] CHEN Z, HUANG Y, CHEN J, et al. LaKo: knowledge-driven visual question answering via late knowledge-to-text injection[C]//Proceedings of the 11th International Joint Conference on Knowledge Graphs, 2022: 20-29. [23] WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 2712-2721. [24] AUER S, BIZER C, KOBILAROV G, et al. DBpedia: a nucleus for a web of open data[C]//Proceedings of the 6th International Semantic Web Conference, the 2nd Asian Semantic Web Conference. Berlin, Heidelberg: Springer, 2007: 722-735. [25] TANDON N, DE MELO G, WEIKUM G. WebChild 2.0: fine-grained commonsense knowledge distillation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 115-120. [26] BHAKTHAVATSALAM S, RICHARDSON K, TANDON N, et al. Do dogs have whiskers? A new knowledge base of hasPart relations[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2006.07510. [27] LIN W Z, CHEN J H, MEI J B, et al. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2309.17133. [28] LIN Y Z, XIE Y J, CHEN D D, et al. REVIVE: regional visual representation matters in knowledge-based visual question answering[C]//Advances in Neural Information Processing Systems 35, 2022. [29] YU Z, OUYANG X C, SHAO Z W, et al. Prophet: prompting large language models with complementary answer heuristics for knowledge-based visual question answering[EB/OL]. [2024-05-27]. https://arxiv.org/abs/2303.01903. [30] SALABERRIA A, AZKUNE G, DE LACALLE O L, et al. Image captioning for effective use of language models in knowledge-based visual question answering[J]. Expert Systems with Applications, 2023, 212: 118669. [31] YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 3081-3089. [32] GUO J X, LI J N, LI D X, et al. From images to textual prompts: zero-shot visual question answering with frozen large language models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10867-10877. [33] DAI W L, HOU L, SHANG L F, et al. Enabling multimodal generation on CLIP via vision-language knowledge distillation[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 2383-2395. [34] BENDER E M, KOLLER A. Climbing towards NLU: on meaning, form, and understanding in the age of data[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 5185-5198. [35] LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 12888-12900. [36] RAJPURKAR P, JIA R, LIANG P. Know what you don??t know: unanswerable questions for SQuAD[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 784-789. [37] SACHAN D, REDDY S, HAMILTON W L, et al. End-to-end training of multi-document reader and retriever for open-domain question answering[C]//Advances in Neural Information Processing Systems 34, 2021: 25968-25981. [38] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2024-05-27]. https://arxiv.org/abs/1907.11692. [39] JIAO F K, GUO Y Y, NIU Y L, et al. REPT: bridging language models and machine reading comprehension via retrieval-based pre-training[C]//Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Stroudsburg: ACL, 2021: 150-163. [40] IZACARD G, GRAVE E. Distilling knowledge from reader to retriever for question answering[C]//Proceedings of the 9th International Conference on Learning Representations, 2021. [41] ZHANG P C, LI X J, HU X W, et al. VinVL: revisiting visual representations in vision-language models[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 5575-5584. [42] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008. [43] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2024-05-27]. https://arxiv.org/abs/1810.04805. [44] MISHRA A, ANAND A, GUHA P. VQA with cascade of self- and co-attention blocks[EB/OL]. [2024-05-27]. https://arxiv.org/abs/2302.14777. [45] KARPUKHIN V, OGUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 6769-6781. [46] KIM W J, SON B, KIM I. ViLT: vision-and-language transformer without convolution or region supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 5583-5594. [47] TAN H, BANSAL M. LXMERT: learning cross-modality encoder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 5099-5110. [48] LI L H, YATSKAR M, YIN D, et al. VisualBERT: a simple and performant baseline for vision and language[EB/OL]. [2024-05-27]. https://arxiv.org/abs/1908.03557. [49] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3190-3199. [50] GOYAL Y, KHOT T, AGRAWAL A, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[J]. International Journal of Computer Vision, 2019, 127(4): 398-414. [51] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]//Proceedings of the 7th International Conference on Learning Representations, 2019. [52] KIM J H, JUN J H, ZHANG B T. Bilinear attention networks[C]//Advances in Neural Information Processing Systems 31, 2018: 1571-1581. [53] BEN-YOUNES H, CADENE R, CORD M, et al. MUTAN: multimodal tucker fusion for visual question answering[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2631-2639. [54] LUO M, ZENG Y K, BANERJEE P, et al. Weakly-supervised visual-retriever-reader for knowledge-based question answering[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 6417-6431. [55] LU J S, BATRA D, PARIKH D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL]. [2024-05-26]. https://arxiv.org/abs/1908.02265. [56] BAO H B, WANG W H, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[EB/OL]. [2024-05-26]. https://arxiv.org/abs/2111.02358. [57] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[EB/OL]. [2024-05-26]. https://arxiv.org/abs/2204.14198. [58] WANG W H, BAO H B, DONG L, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19175-19186. [59] DRIESS D, XIA F, SAJJADI M S M, et al. PaLM-E: an embodied multimodal language model[EB/OL]. [2024-05-26]. https://arxiv.org/abs/2303.03378. |
| [1] | XU Wei, ZHANG Xiaolin, ZHANG Huanxiang, ZHANG Jing. Combining Dual-Granularity Image Information for Multimodal Aspect-Based Sentiment Analysis [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(9): 2479-2492. |
| [2] | LI Yi, LI Hao, XU Xiaozhe, YANG Yifan. CFB:Financial Large Models Evaluation Methods [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(12): 3272-3287. |
| [3] | XUE Di, LI Xin, LIU Mingshuai. PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2912-2924. |
| [4] | GUO Leming, XUE Wanli, YUAN Tiantian. Multi-scale Visual Feature Extraction and Cross-Modality Alignment for Continuous Sign Language Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2762-2769. |
| [5] | XU Biqi, MA Zhiqiang, ZHOU Yutong, JIA Wenchao, LIU Jia, LYU Kai. Survey of Research on Knowledge-Driven Dialogue Generation Models [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 58-74. |
| [6] | WANG Yu, SUN Haichun. Review of Visual Question Answering Technology [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(7): 1487-1505. |
| [7] | LUO Xuemei, ZHENG Haihong, AN Yaqiang, WANG Di. Online Graph Regularized Non-negative Matrix Factorization Cross-Modal Hashing [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(3): 678-686. |
| [8] | GU Yuying, GAO Meifeng. Aspect-Level Sentiment Analysis Combining Part-of-Speech and External Knowledge [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(10): 2488-2498. |
| [9] | SHI Yucheng, WU Yun, LONG Huiyun. Cross-Modal Fusion of RGB-D Salient Detection for Advanced Semantic Repair Strategy [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 140-153. |
| [10] | LIU Ying, GUO Yingying, FANG Jie, FAN Jiulun, HAO Yu, LIU Jiming. Survey of Research on Deep Learning Image-Text Cross-Modal Retrieval [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(3): 489-511. |
| [11] | CHEN Ning, DUAN Youxiang, SUN Qifeng. Literature Review of Cross-Modal Retrieval Research [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(8): 1390-1404. |
| [12] | ZHU Jie, BAI Hongyu, ZHANG Zhongyu, XIE Bojun, ZHANG Junsan. Object Feature Based Deep Hashing for Cross-Modal Retrieval [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(5): 922-930. |
| [13] | TIAN Xin, JI Yi, GAO Haiyan, LIN Xin, LIU Chunping. Scene Graph Generation Method Based on External Information Guidance and Residual Scrambling [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1958-1968. |
| [14] | LIN Yang, CHU Xu, WANG Yasha, MAO Weijia, ZHAO Junfeng. Cross-Modal Recipe Retrieval with Self-Attention Mechanism [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(9): 1471-1481. |
| [15] | JI Zhong, LI Huihui, HE Yuqing. Zero-Shot Multi-Label Image Classification Based on Deep Instance Differentiation [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(1): 97-105. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/