基于大语言模型的PTCR外部知识型视觉问答框架

doi:10.3778/j.issn.1673-9418.2406028

摘要/Abstract

摘要： 针对外部知识型视觉问答（VQA）模型输入信息不足、推理性能差的问题，构建了一种基于大语言模型（LLM）的PTCR外部知识型VQA框架。该框架由答案候选生成、针对性图像描述、自主式思维链构建、提示LLM推理四部分构成。PTCR框架使用LLM指导多模态大模型生成针对性的图像描述，解决了以往图像标题覆盖不全面的问题；通过LLM自主生成思维链，并在推理过程中提供相似问题的思考过程，提高了模型的推理能力；在推理过程引入选项重排技术消除LLM的选择位置偏见，通过多数投票的方式降低了推理的随机性误差。实验结果表明，经PTCR框架增强的CogVLM模型，其准确率在OK-VQA、A-OKVQA数据集上分别提升了16.7个百分点、13.3个百分点。同时，与Prophet相比，PTCR框架在OK-VQA、A-OKVQA数据集上准确率分别提升了3.4个百分点、5.0个百分点。消融实验的结果证明，所使用的针对性图像描述、自主式思维链等方法对准确率均有提升效果。可见PTCR框架在改进外部知识型VQA任务性能方面有所提升。

关键词: 视觉问答, 提示工程, 大语言模型, 跨模态

Abstract: Aiming at the problems of insufficient model input information and poor reasoning performance in knowledge-based visual question answering (VQA), this paper constructs a PTCR knowledge-based VQA framework based on large language model (LLM), which consists of four parts: answer candidate generation, targeted image descriptions, autonomous chain of thought (CoT) construction, and prompted LLM inference. The PTCR framework uses LLM to guide multimodal large language models to generate targeted image descriptions, which solves the problem of incomplete coverage of previous image captions. It improves the model??s reasoning ability by guiding LLM to autonomously generate CoT, which provides the thinking process of similar problems during the reasoning process; and it introduces selection rearrangement technology to eliminate LLM??s selection location discrimination during the reasoning process, and reduces the randomness error of the reasoning by means of majority voting. Experimental results show that the accuracy of the CogVLM model enhanced by the PTCR framework is improved by 16.7 percentage points and 13.3 percentage points on the OK-VQA and A-OKVQA datasets. Meanwhile, compared with Prophet, the accuracy of the PTCR framework is improved by 3.4 percentage points and 5.0 percentage points on the OK-VQA and A-OKVQA datasets. The results of ablation experiments demonstrate that the methods used in this paper, such as targeted image descriptions and autonomous chains of thought, are all effective in improving accuracy. It is evident that the PTCR framework has improved the performance of knowledge-based VQA.

Key words: visual question answering, prompt engineering, large language model, cross-modal

薛迪, 李欣, 刘明帅. 基于大语言模型的PTCR外部知识型视觉问答框架[J]. 计算机科学与探索, 2024, 18(11): 2912-2924.

XUE Di, LI Xin, LIU Mingshuai. PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2912-2924.

参考文献

[1] ANTOL S, AGRAWAL A, LU J, et al. VQA: visual quetion answering[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2015: 2425-2433.
[2] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 6904-6913.
[3] CHEN X, WANG X, CHANGPINYO S, et al. Pali: a jointly-scaled multilingual language-image model[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2209.06794.
[4] WANG P, WANG S, LIN J, et al. One-peace: exploring one general representation model toward unlimited modalities[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2305.11172.
[5] BAO H, WANG W, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 32897-32912.
[6] YANG Z, GAN Z, WANG J, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2022: 3081-3089.
[7] HU Y, HUA H, YANG Z, et al. PromptCap: prompt-guided image captioning for VQA with GPT-3[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2963-2975.
[8] SHAO Z, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14974-14983.
[9] HU Z, YANG P, JIANG Y, et al. Prompting large language model with context and pre-answer for knowledge-based VQA[J]. Pattern Recognition, 2024, 151: 110399.
[10] ZHU D, CHEN J, SHEN X, et al. MiniGPT-4: enhancing vision-language understanding with advanced large language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2304.10592.
[11] LI J, LI D, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 2023 International Conference on Machine Learning, Honolulu, Jul 23-29, 2023: 19730-19742.
[12] ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT [EB/OL]. [2024-04-14]. https://arxiv.org/abs/2302.09419.
[13] LAURIOLA I, LAVELLI A, AIOLLI F. An introduction to deep learning in natural language processing: models, techniques, and tools[J]. Neurocomputing, 2022, 470: 443-456.
[14] LIU Y, ZHANG Y, WANG Y, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498.
[15] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368.
[16] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3195-3204.
[17] SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 8317-8326.
[18] MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the 2019 International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2019: 947-952.
[19] LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 2507-2521.
[20] ZHANG D, YU Y, LI C, et al. MM-LLMs: recent advances in multimodal large language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2401.13601.
[21] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 23716-23736.
[22] ZHENG K, HE X, WANG X E. MiniGPT-5: interleaved vision-and-language generation via generative vokens[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2310.02239.
[23] TEAM G, ANIL R, BORGEAUD S, et al. Gemini: a family of highly capable multimodal models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2312.11805.
[24] WANG W, LV Q, YU W, et al. CogVLM: visual expert for pretrained language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311.03079.
[25] WANG W, BAO H, DONG L, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19175-19186.
[26] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162.
[27] ZHU Z, YU J, WANG Y, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2006. 09073.
[28] MARINO K, CHEN X, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14111-14121.
[29] VRANDE?I? D, KR?TZSCH M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
[30] SPEER R, CHIN J, HAVASI C, et al. ConceptNet 5.5: an open multilingual graph of general knowledge[C]//Proceedings of the 2017 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2017: 4444-4451.
[31] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901.
[32] SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. [2024-04-14]. https://arxiv. org/abs/2402.07927.
[33] DONG Q, LI L, DAI D, et al. A survey on in-context learning[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2301.00234.
[34] WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 24824-24837.
[35] NORI H, LEE Y T, ZHANG S, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311. 16452.
[36] KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 22199-22213.
[37] HE R, SUN S, YU X, et al. Is synthetic data from generative models ready for image recognition?[C]//Proceedings of the 11th International Conference on Learning Representations, Kigali, May 1-5, 2023.
[38] CHEN L, LI J, DONG X, et al. ShareGPT4V: improving large multi-modal models with better captions[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311.12793.
[39] ZHENG L, CHIANG W L, SHENG Y, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena[C]//Advances in Neural Information Processing Systems 36, New Orleans, Dec 10-16, 2023.
[40] GAO F, PING Q, THATTAI G, et al. Transform-retrieve-generate: natural language-centric outside-knowledge visual question answering[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5067-5077.
[41] WU J, LU J, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2022: 2712-2721.
[42] GUO Y, NIE L, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069.
[43] KAMATH A, CLARK C, GUPTA T, et al. Webly supervised concept expansion for general purpose vision models[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 662-681.