
Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (11): 2912-2924.DOI: 10.3778/j.issn.1673-9418.2406028
• Special Issue on Constructions and Applications of Large Language Models in Specific Domains • Previous Articles Next Articles
XUE Di, LI Xin, LIU Mingshuai
Online:2024-11-01
Published:2024-10-31
薛迪,李欣,刘明帅
XUE Di, LI Xin, LIU Mingshuai. PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2912-2924.
薛迪, 李欣, 刘明帅. 基于大语言模型的PTCR外部知识型视觉问答框架[J]. 计算机科学与探索, 2024, 18(11): 2912-2924.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2406028
| [1] ANTOL S, AGRAWAL A, LU J, et al. VQA: visual quetion answering[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2015: 2425-2433. [2] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 6904-6913. [3] CHEN X, WANG X, CHANGPINYO S, et al. Pali: a jointly-scaled multilingual language-image model[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2209.06794. [4] WANG P, WANG S, LIN J, et al. One-peace: exploring one general representation model toward unlimited modalities[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2305.11172. [5] BAO H, WANG W, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 32897-32912. [6] YANG Z, GAN Z, WANG J, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2022: 3081-3089. [7] HU Y, HUA H, YANG Z, et al. PromptCap: prompt-guided image captioning for VQA with GPT-3[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2963-2975. [8] SHAO Z, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14974-14983. [9] HU Z, YANG P, JIANG Y, et al. Prompting large language model with context and pre-answer for knowledge-based VQA[J]. Pattern Recognition, 2024, 151: 110399. [10] ZHU D, CHEN J, SHEN X, et al. MiniGPT-4: enhancing vision-language understanding with advanced large language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2304.10592. [11] LI J, LI D, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 2023 International Conference on Machine Learning, Honolulu, Jul 23-29, 2023: 19730-19742. [12] ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT [EB/OL]. [2024-04-14]. https://arxiv.org/abs/2302.09419. [13] LAURIOLA I, LAVELLI A, AIOLLI F. An introduction to deep learning in natural language processing: models, techniques, and tools[J]. Neurocomputing, 2022, 470: 443-456. [14] LIU Y, ZHANG Y, WANG Y, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498. [15] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368. [16] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3195-3204. [17] SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 8317-8326. [18] MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the 2019 International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2019: 947-952. [19] LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 2507-2521. [20] ZHANG D, YU Y, LI C, et al. MM-LLMs: recent advances in multimodal large language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2401.13601. [21] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 23716-23736. [22] ZHENG K, HE X, WANG X E. MiniGPT-5: interleaved vision-and-language generation via generative vokens[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2310.02239. [23] TEAM G, ANIL R, BORGEAUD S, et al. Gemini: a family of highly capable multimodal models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2312.11805. [24] WANG W, LV Q, YU W, et al. CogVLM: visual expert for pretrained language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311.03079. [25] WANG W, BAO H, DONG L, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19175-19186. [26] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162. [27] ZHU Z, YU J, WANG Y, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2006. 09073. [28] MARINO K, CHEN X, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14111-14121. [29] VRANDE?I? D, KR?TZSCH M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85. [30] SPEER R, CHIN J, HAVASI C, et al. ConceptNet 5.5: an open multilingual graph of general knowledge[C]//Proceedings of the 2017 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2017: 4444-4451. [31] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901. [32] SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. [2024-04-14]. https://arxiv. org/abs/2402.07927. [33] DONG Q, LI L, DAI D, et al. A survey on in-context learning[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2301.00234. [34] WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 24824-24837. [35] NORI H, LEE Y T, ZHANG S, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311. 16452. [36] KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 22199-22213. [37] HE R, SUN S, YU X, et al. Is synthetic data from generative models ready for image recognition?[C]//Proceedings of the 11th International Conference on Learning Representations, Kigali, May 1-5, 2023. [38] CHEN L, LI J, DONG X, et al. ShareGPT4V: improving large multi-modal models with better captions[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311.12793. [39] ZHENG L, CHIANG W L, SHENG Y, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena[C]//Advances in Neural Information Processing Systems 36, New Orleans, Dec 10-16, 2023. [40] GAO F, PING Q, THATTAI G, et al. Transform-retrieve-generate: natural language-centric outside-knowledge visual question answering[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5067-5077. [41] WU J, LU J, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2022: 2712-2721. [42] GUO Y, NIE L, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069. [43] KAMATH A, CLARK C, GUPTA T, et al. Webly supervised concept expansion for general purpose vision models[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 662-681. |
| [1] | CUI Jian, WANG Yongwei, LI Feiyang, LI Qiang, SU Beirong, ZHANG Xiaojian. Chinese Text?Summarization Generation with Knowledge Distillation [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1899-1908. |
| [2] | XIA Jianglan, LI Yanling, GE Fengpei. Survey of Entity Relation Extraction Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1681-1698. |
| [3] | SHI Zhenpu, LYU Xiao, DONG Yanru, LIU Jing, WANG Xiaoyan. Research on Development Status of Multimodal Knowledge Graph Fusion Technology in Medical Field [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1729-1746. |
| [4] | XU Delong, LIN Min, WANG Yurong, ZHANG Shujun. Survey of NLP Data Augmentation Methods Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1395-1413. |
| [5] | ZHANG Xin, SUN Jingchao. Review of False Information Detection Frameworks Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1414-1436. |
| [6] | ZHOU Hanwen, DENG Zhaohong, ZHANG Wei. Global and Cross-Semantic Aggregation for Multi-level Enzyme Function Prediction [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1588-1597. |
| [7] | HE Jing, SHEN Yang, XIE Runfeng. Research on Categorical Recognition and Optimization of Hallucination Phenomenon in Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(5): 1295-1301. |
| [8] | LI Juhao, SHI Lei, DING Meng, LEI Yongsheng, ZHAO Dongyue, CHEN Long. Social Media Text Stance Detection Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(5): 1302-1312. |
| [9] | LIU Hualing, ZHANG Zilong, PENG Hongshuai. Review of Enhancement Research for Closed-Source Large Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(5): 1141-1156. |
| [10] | CHANG Baofa, CHE Chao, LIANG Yan. Research on Recommendation Model Based on Multi-round Dialogue of Large Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(2): 385-395. |
| [11] | WANG Xiaoyu, LI Xin, HU Mianning, XUE Di. CIL-LLM: Incremental Learning Framework Based on Large Language Models for Category Classification [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(2): 374-384. |
| [12] | FENG Tuoyu, WANG Gangliang, QIAO Zijian, LI Weiping, ZHANG Yusong, GUO Qinglang. SbSER: Step-by-Step Enhanced Reasoning Framework for Large Language Model with External Subgraph Generation [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(2): 367-373. |
| [13] | XU Fengru, LI Bohan, XU Shuai. Research Progress on Sequence Recommendation Based on Deep Learning and Large Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(2): 344-366. |
| [14] | YUE Qi, ZHANG Chenkang. Survey on Applications of AIGC in Multimodal Scenarios [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(1): 79-96. |
| [15] | LI Boxin. Method of Retrieval-Augmented Large Language Models with Stable Outputs for Private Question-Answering Systems [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(1): 132-140. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/