[1] ANTOL S, AGRAWAL A, LU J, et al. VQA: visual quetion answering[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2015: 2425-2433.
[2] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 6904-6913.
[3] CHEN X, WANG X, CHANGPINYO S, et al. Pali: a jointly-scaled multilingual language-image model[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2209.06794.
[4] WANG P, WANG S, LIN J, et al. One-peace: exploring one general representation model toward unlimited modalities[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2305.11172.
[5] BAO H, WANG W, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 32897-32912.
[6] YANG Z, GAN Z, WANG J, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2022: 3081-3089.
[7] HU Y, HUA H, YANG Z, et al. PromptCap: prompt-guided image captioning for VQA with GPT-3[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2963-2975.
[8] SHAO Z, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14974-14983.
[9] HU Z, YANG P, JIANG Y, et al. Prompting large language model with context and pre-answer for knowledge-based VQA[J]. Pattern Recognition, 2024, 151: 110399.
[10] ZHU D, CHEN J, SHEN X, et al. MiniGPT-4: enhancing vision-language understanding with advanced large language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2304.10592.
[11] LI J, LI D, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 2023 International Conference on Machine Learning, Honolulu, Jul 23-29, 2023: 19730-19742.
[12] ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT [EB/OL]. [2024-04-14]. https://arxiv.org/abs/2302.09419.
[13] LAURIOLA I, LAVELLI A, AIOLLI F. An introduction to deep learning in natural language processing: models, techniques, and tools[J]. Neurocomputing, 2022, 470: 443-456.
[14] LIU Y, ZHANG Y, WANG Y, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498.
[15] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368.
[16] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3195-3204.
[17] SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 8317-8326.
[18] MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the 2019 International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2019: 947-952.
[19] LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 2507-2521.
[20] ZHANG D, YU Y, LI C, et al. MM-LLMs: recent advances in multimodal large language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2401.13601.
[21] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 23716-23736.
[22] ZHENG K, HE X, WANG X E. MiniGPT-5: interleaved vision-and-language generation via generative vokens[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2310.02239.
[23] TEAM G, ANIL R, BORGEAUD S, et al. Gemini: a family of highly capable multimodal models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2312.11805.
[24] WANG W, LV Q, YU W, et al. CogVLM: visual expert for pretrained language models[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311.03079.
[25] WANG W, BAO H, DONG L, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19175-19186.
[26] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162.
[27] ZHU Z, YU J, WANG Y, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2006. 09073.
[28] MARINO K, CHEN X, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14111-14121.
[29] VRANDE?I? D, KR?TZSCH M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
[30] SPEER R, CHIN J, HAVASI C, et al. ConceptNet 5.5: an open multilingual graph of general knowledge[C]//Proceedings of the 2017 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2017: 4444-4451.
[31] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901.
[32] SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. [2024-04-14]. https://arxiv. org/abs/2402.07927.
[33] DONG Q, LI L, DAI D, et al. A survey on in-context learning[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2301.00234.
[34] WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 24824-24837.
[35] NORI H, LEE Y T, ZHANG S, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311. 16452.
[36] KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 22199-22213.
[37] HE R, SUN S, YU X, et al. Is synthetic data from generative models ready for image recognition?[C]//Proceedings of the 11th International Conference on Learning Representations, Kigali, May 1-5, 2023.
[38] CHEN L, LI J, DONG X, et al. ShareGPT4V: improving large multi-modal models with better captions[EB/OL]. [2024-04-14]. https://arxiv.org/abs/2311.12793.
[39] ZHENG L, CHIANG W L, SHENG Y, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena[C]//Advances in Neural Information Processing Systems 36, New Orleans, Dec 10-16, 2023.
[40] GAO F, PING Q, THATTAI G, et al. Transform-retrieve-generate: natural language-centric outside-knowledge visual question answering[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5067-5077.
[41] WU J, LU J, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2022: 2712-2721.
[42] GUO Y, NIE L, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069.
[43] KAMATH A, CLARK C, GUPTA T, et al. Webly supervised concept expansion for general purpose vision models[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 662-681. |