Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (11): 2912-2924.DOI: 10.3778/j.issn.1673-9418.2406028
• Special Issue on Constructions and Applications of Large Language Models in Specific Domains • Previous Articles Next Articles
XUE Di, LI Xin, LIU Mingshuai
Online:
2024-11-01
Published:
2024-10-31
薛迪,李欣,刘明帅
XUE Di, LI Xin, LIU Mingshuai. PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2912-2924.
薛迪, 李欣, 刘明帅. 基于大语言模型的PTCR外部知识型视觉问答框架[J]. 计算机科学与探索, 2024, 18(11): 2912-2924.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2406028
[1] ANTOL S, AGRAWAL A, LU J, et al. VQA: visual quetion answering[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2015: 2425-2433. [2] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 6904-6913. [3] CHEN X, WANG X, CHANGPINYO S, et al. Pali: a jointly-scaled multilingual language-image model[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2209.06794. [4] WANG P, WANG S, LIN J, et al. One-peace: exploring one general representation model toward unlimited modalities[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2305.11172. [5] BAO H, WANG W, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 32897-32912. [6] YANG Z, GAN Z, WANG J, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2022: 3081-3089. [7] HU Y, HUA H, YANG Z, et al. PromptCap: prompt-guided image captioning for VQA with GPT-3[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2963-2975. [8] SHAO Z, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14974-14983. [9] HU Z, YANG P, JIANG Y, et al. Prompting large language model with context and pre-answer for knowledge-based VQA[J]. Pattern Recognition, 2024, 151: 110399. [10] ZHU D, CHEN J, SHEN X, et al. MiniGPT-4: enhancing vision-language understanding with advanced large language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2304.10592. [11] LI J, LI D, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 2023 International Conference on Machine Learning, Honolulu, Jul 23-29, 2023: 19730-19742. [12] ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT [EB/OL]. [2024-04-14]. https://arxiv.org/abs/2302.09419. [13] LAURIOLA I, LAVELLI A, AIOLLI F. An introduction to deep learning in natural language processing: models, techniques, and tools[J]. Neurocomputing, 2022, 470: 443-456. [14] LIU Y, ZHANG Y, WANG Y, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498. [15] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368. [16] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3195-3204. [17] SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 8317-8326. [18] MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the 2019 International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2019: 947-952. [19] LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 2507-2521. [20] ZHANG D, YU Y, LI C, et al. MM-LLMs: recent advances in multimodal large language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2401.13601. [21] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 23716-23736. [22] ZHENG K, HE X, WANG X E. MiniGPT-5: interleaved vision-and-language generation via generative vokens[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2310.02239. [23] TEAM G, ANIL R, BORGEAUD S, et al. Gemini: a family of highly capable multimodal models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2312.11805. [24] WANG W, LV Q, YU W, et al. CogVLM: visual expert for pretrained language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311.03079. [25] WANG W, BAO H, DONG L, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19175-19186. [26] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162. [27] ZHU Z, YU J, WANG Y, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2006. 09073. [28] MARINO K, CHEN X, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14111-14121. [29] VRANDE?I? D, KR?TZSCH M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85. [30] SPEER R, CHIN J, HAVASI C, et al. ConceptNet 5.5: an open multilingual graph of general knowledge[C]//Proceedings of the 2017 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2017: 4444-4451. [31] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901. [32] SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. [2024-04-14]. https://arxiv. org/abs/2402.07927. [33] DONG Q, LI L, DAI D, et al. A survey on in-context learning[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2301.00234. [34] WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 24824-24837. [35] NORI H, LEE Y T, ZHANG S, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311. 16452. [36] KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 22199-22213. [37] HE R, SUN S, YU X, et al. Is synthetic data from generative models ready for image recognition?[C]//Proceedings of the 11th International Conference on Learning Representations, Kigali, May 1-5, 2023. [38] CHEN L, LI J, DONG X, et al. ShareGPT4V: improving large multi-modal models with better captions[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311.12793. [39] ZHENG L, CHIANG W L, SHENG Y, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena[C]//Advances in Neural Information Processing Systems 36, New Orleans, Dec 10-16, 2023. [40] GAO F, PING Q, THATTAI G, et al. Transform-retrieve-generate: natural language-centric outside-knowledge visual question answering[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5067-5077. [41] WU J, LU J, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2022: 2712-2721. [42] GUO Y, NIE L, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069. [43] KAMATH A, CLARK C, GUPTA T, et al. Webly supervised concept expansion for general purpose vision models[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 662-681. |
[1] | XIANG Xiaowei, SHEN Yanguang, HU Minghao, YAN Tianwei, LUO Wei, LUO Zhunchen. Research on Science and Technology Policy and Regulation Q&A System Driven by Large Models [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2349-2360. |
[2] | LI Yifei, ZHANG Lingling, DONG Yuxuan, WANG Jiaxin, ZHONG Yujie, WEI Bifan. Large Language Model Augmentation and Feature Alignment Method for Few-Shot Continual Relation Extraction [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2326-2336. |
[3] | JI Guiyang, WANG Peiyan, YU Zhuo. Research on Knowledge Injection Method for Large Language Model Oriented to Process Specification Texts [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2361-2369. |
[4] | CHEN Longfei, GAO Xin, HOU Haotian, YE Chuyang, LIU Ya'ou, ZHANG Meihui. Application of Generative Large Language Models in Chinese Radiology Domain [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2337-2348. |
[5] | LUO Shijie, JIN Rize, HAN Shuzhen. Research on University Basic Knowledge Question-Answering Using Low-Rank Encoding to Optimize Large Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 2156-2168. |
[6] | SHENG Lei, CHEN Xiliang, LAI Jun. Offline Multi-agent Reinforcement Learning Method Based on Latent State Distribution GPT [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 2169-2179. |
[7] | ZHANG Qi, ZHONG Hao. Submodular Optimization Approach for Entity Summarization in Knowledge Graph Driven by Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1806-1813. |
[8] | FENG Jun, CHANG Yanghong, LU Jiamin, TANG Hailin, LYU Zhipeng, QIU Yuchun. Construction and Application of Knowledge Graph for Water Engineering Scheduling Based on Large Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(6): 1637-1647. |
[9] | FENG Tuoyu, LI Weiping, GUO Qinglang, WANG Gangliang, ZHANG Yusong, QIAO Zijian. Overview of Knowledge Graph Question Answering Enhanced by Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2887-2900. |
[10] | LIU Jun, LENG Fangling, WU Wangwang, BAO Yubin. Construction Method of Textbook Knowledge Graph Based on Multimodal and Knowledge Distillation [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2901-2911. |
[11] | LIU Xin, GAO Huiquan, SHAO Changheng, CHEN Ziliang, LU Wenjuan, YANG Huiru. Construction and Application of Large Language Model for Public Complaints with Knowledge Reasoning and Similarity Retrieval [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2940-2953. |
[12] | SANG Chenyang, MA Tinghuai, XIE Xintong, SUN Shengjie, HUANG Rui. Multi-stage Reasoning Method for Emotional Support Dialogue Generation Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2925-2939. |
[13] | JI Xiangyu, WANG Xin, ZHANG Heyi, MENG Zhaopeng, ZHANG Junhua, ZHUANG Pengwei, JIA Yongzhe, XU Dawei. Knowledge Augmentation on Traditional Chinese Medicine Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2616-2629. |
[14] | LIANG Jia, ZHANG Liping, YAN Sheng, ZHAO Yubo, ZHANG Yawen. Research Progress of Named Entity Recognition Based on Large Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2594-2615. |
[15] | GUO Leming, XUE Wanli, YUAN Tiantian. Multi-scale Visual Feature Extraction and Cross-Modality Alignment for Continuous Sign Language Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2762-2769. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/