计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (11): 2912-2924.DOI: 10.3778/j.issn.1673-9418.2406028

• 垂直领域大模型构建与应用专题 • 上一篇    下一篇

基于大语言模型的PTCR外部知识型视觉问答框架

薛迪,李欣,刘明帅   

  1. 1. 中国人民公安大学 信息网络安全学院,北京 100038
    2. 安全防范技术与风险评估公安部重点实验室,北京 100026
  • 出版日期:2024-11-01 发布日期:2024-10-31

PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model

XUE Di, LI Xin, LIU Mingshuai   

  1. 1. School of Information and Cyber Security, People??s Public Security University of China, Beijing 100038, China
    2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 100026, China
  • Online:2024-11-01 Published:2024-10-31

摘要: 针对外部知识型视觉问答(VQA)模型输入信息不足、推理性能差的问题,构建了一种基于大语言模型(LLM)的PTCR外部知识型VQA框架。该框架由答案候选生成、针对性图像描述、自主式思维链构建、提示LLM推理四部分构成。PTCR框架使用LLM指导多模态大模型生成针对性的图像描述,解决了以往图像标题覆盖不全面的问题;通过LLM自主生成思维链,并在推理过程中提供相似问题的思考过程,提高了模型的推理能力;在推理过程引入选项重排技术消除LLM的选择位置偏见,通过多数投票的方式降低了推理的随机性误差。实验结果表明,经PTCR框架增强的CogVLM模型,其准确率在OK-VQA、A-OKVQA数据集上分别提升了16.7个百分点、13.3个百分点。同时,与Prophet相比,PTCR框架在OK-VQA、A-OKVQA数据集上准确率分别提升了3.4个百分点、5.0个百分点。消融实验的结果证明,所使用的针对性图像描述、自主式思维链等方法对准确率均有提升效果。可见PTCR框架在改进外部知识型VQA任务性能方面有所提升。

关键词: 视觉问答, 提示工程, 大语言模型, 跨模态

Abstract: Aiming at the problems of insufficient model input information and poor reasoning performance in knowledge-based visual question answering (VQA), this paper constructs a PTCR knowledge-based VQA framework based on large language model (LLM), which consists of four parts: answer candidate generation, targeted image descriptions, autonomous chain of thought (CoT) construction, and prompted LLM inference. The PTCR framework uses LLM to guide multimodal large language models to generate targeted image descriptions, which solves the problem of incomplete coverage of previous image captions. It improves the model??s reasoning ability by guiding LLM to autonomously generate CoT, which provides the thinking process of similar problems during the reasoning process; and it introduces selection rearrangement technology to eliminate LLM??s selection location discrimination during the reasoning process, and reduces the randomness error of the reasoning by means of majority voting. Experimental results show that the accuracy of the CogVLM model enhanced by the PTCR framework is improved by 16.7 percentage points and 13.3 percentage points on the OK-VQA and A-OKVQA datasets. Meanwhile, compared with Prophet, the accuracy of the PTCR framework is improved by 3.4 percentage points and 5.0 percentage points on the OK-VQA and A-OKVQA datasets. The results of ablation experiments demonstrate that the methods used in this paper, such as targeted image descriptions and autonomous chains of thought, are all effective in improving accuracy. It is evident that the PTCR framework has improved the performance of knowledge-based VQA.

Key words: visual question answering, prompt engineering, large language model, cross-modal