Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

A Technical Framework for Visual Question Answering System of Business Knowledge for Case- Related Properties

XUE Di,  LI Xin,  JIANG Zhangtao,  WANG Xiaoyu,  LIU Mingshuai   

  1. 1. School of Information and Cyber Security, People’s Public Security University of China, Beijing 100038, China
    2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 100026, China
    3. Public Security Big Data Strategy Research Center, People's Public Security University of China, Beijing 100038, China

面向涉案财物的业务知识视觉问答技术框架

薛迪,李欣,蒋章涛,王晓宇,刘明帅   

  1. 1. 中国人民公安大学 信息网络安全学院,北京 100038
    2. 安全防范技术与风险评估公安部重点实验室,北京 100026
    3. 中国人民公安大学公安大数据战略研究中心,北京 100038

Abstract: As a key link in the process of handling administrative and criminal cases by the public security authorities, the standardized management and legal disposal of property involved in cases is directly related to the fairness and procedural legality of law-enforcement activities, and in recent years the relevant departments have issued a series of standardized legal documents to guide public security police in carrying out this work. However, the processes of investigation and evidence collection, seizure, impoundment, freezing and receipt of property involved in different cases are not identical, making it difficult for the relevant civilian police to understand scientifically and accurately and to implement the provisions strictly. In order to help the grassroots public security police to complete the management of property-related work, for the business knowledge visual question answering model of property-related business knowledge information is missing, the recall rate of retrieval augmented generation is low, and the performance of the model inference is poor and so on, the business knowledge visual question answering technology architecture for property-related work is proposed. The multimodal large model is used to rewrite the complementary problem according to the image, which solves the problem that the direct retrieval cannot hit the relevant information. The knowledge base as well as the query are vectorized through the Conan-embedding model, which improves the knowledge retrieval ability of the model. We constructed a visual question answering dataset and public security knowledge base on the business knowledge of case-related property, and for the type of legal and regulatory documents, we discarded the conventional fixed block storage method and used dynamic segmentation technology to save the data according to the article. The use of LongLLMLingua model compresses the retrieved external knowledge according to the rewritten questions, which improves the model accuracy on the basis of effectively lower context length. The experimental results show that the accuracy rate of the method proposed in this paper reaches 71.98%, which is improved by 18.68% compared with the direct use of GLM-V, and is better than other baseline models, which verifies the effectiveness of the method.

Key words: Visual Question Answering, Retrieval Augmented Generation, People's Public Security case-related properties, Large Language Model, Multimodal Large Language Models

摘要: 作为公安机关办理行政及刑事案件流程中的关键环节,涉案财物的规范化管理与依法处置直接关系到执法活动的公正性和程序合法,近年来有关部门相继出台了一系列规范性法律文件以指导公安民警开展此项工作。然而,不同涉案财物的调查取证、查封、扣押、冻结、接收等流程不尽相同,相关民警难以科学准确理解,严格执行规定。为帮助基层公安民警完成涉案财物管理工作,针对业务知识视觉问答模型涉案财物业务知识信息缺失、常规检索增强生成技术召回率低、模型推理性能差等问题,提出了涉案财物业务知识视觉问答技术框架。使用多模态大模型根据图像改写补全问题,解决了直接检索无法命中相关信息的问题。通过Conan-embedding模型对知识库以及查询进行向量化,提高了模型的知识检索能力。构建了关于涉案财物业务知识视觉问答数据集与公安知识库,针对法律法规类型文件,摒弃常规的固定切块储存方法,采用动态分割技术按条保存数据。使用LongLLMLingua模型根据改写后的问题对检索的外部知识进行压缩,在有效较低上下文长度的基础上提高了模型准确率。实验结果显示,本文提出的方法准确率达到71.98%,相较直接使用GLM-V提升了18.68个百分点,优于其他基线模型,验证了该方法的有效性。

关键词: 视觉问答, 检索增强生成, 公安涉案财物, 大语言模型, 多模态大模型