Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

PTCR: Knowledge-based Visual Question Answering Framework Based on Large Language Model

XUE Di,  LI Xin,  LIU Mingshuai   

  1. 1. School of Information and Cyber Security, People’s Public Security University of China, Beijing 100038, China
    2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 100026, China

基于大语言模型的PTCR外部知识型视觉问答框架

薛迪, 李欣, 刘明帅   

  1. 1. 中国人民公安大学 信息网络安全学院, 北京 100038
    2. 安全防范技术与风险评估公安部重点实验室, 北京 100026

Abstract: Aiming at the problems of insufficient model input information and poor reasoning performance in knowledge-based Visual Question Answering (VQA), this paper constructs a PTCR knowledge-based VQA framework based on Large Language Model (LLM), which consists of four parts: answer candidate generation, targeted image descriptions, autonomous Chain of Thought (CoT) construction, and prompted LLM inference. The PTCR framework uses LLM to guide Multi Modal Large Language Models to generate targeted image descriptions, which solves the problem of incomplete coverage of previous image captions; it improves the model's reasoning ability by guiding LLM to autonomously generate CoT, which provides the thinking process of similar problems during the reasoning process; and it introduces selection rearrangement technology to eliminate LLM's selection location discrimination during the reasoning process, and reduces the randomness error of the reasoning by means of majority voting. Experimentally, it can be obtained that the accuracy of the CogVLM model enhanced by the PTCR framework is improved by 16.7% and 13.3% on the OK-VQA and A-OKVQA datasets. Meanwhile, compared with Prophet, the accuracy of the PTCR framework is improved by 3.4% and 5.0% on the OK-VQA and A-OKVQA datasets. The results of the ablation experiments demonstrated that the methods used in this paper, such as targeted image descriptions and autonomous chains of thought, were all effective in improving accuracy. It is evident that the PTCR framework has improved the performance of knowledge-based VQA.

Key words: Visual Question Answering, Prompt Engineering, Large Language Model, Cross-modal

摘要: 针对外部知识型视觉问答(Visual Question Answering, VQA)模型输入信息不足、推理性能差的问题,本文构建了一种基于大语言模型(Large Language Model, LLM)的PTCR外部知识型VQA框架,该框架由答案候选生成、针对性图像描述、自主式思维链构建、提示LLM推理四部分构成。PTCR框架使用LLM指导多模态大模型生成针对性的图像描述,解决了以往图像标题覆盖不全面的问题;通过LLM自主生成思维链,并在推理过程中提供相似问题的思考过程,提高了模型的推理能力;最后在推理过程引入选项重排技术消除LLM的选择位置歧视,通过多数投票的方式降低了推理的随机性误差。实验结果表明,经PTCR框架增强的CogVLM模型,其准确率在OK-VQA、A-OKVQA数据集上分别提升了16.7%、13.3%。同时,与Prophet相比,PTCR框架在OK-VQA、A-OKVQA数据集上准确率分别提升了3.4%、5.0%。消融实验的结果证明,本文所使用的针对性图像描述、自主式思维链等方法对准确率均有提升效果。可见PTCR框架在改进外部知识型VQA任务性能方面有所提升。

关键词: 视觉问答, 提示工程, 大语言模型, 跨模态