Technical Framework for Visual Question Answering System of Business Knowledge for Case-Related Property

doi:10.3778/j.issn.1673-9418.2502029

Abstract

Abstract: As a key link in the process of handling administrative and criminal cases by the public security authorities, the standardized management and legal disposal of property involved in cases is directly related to the fairness and procedural legality of law-enforcement activities, and in recent years the relevant departments have issued a series of standardized legal documents to guide public security police in carrying out this work. However, the processes of investigation and evidence collection, seizure, impoundment, freezing and receipt of property involved in different cases are not identical, making it difficult for the relevant civilian police to understand scientifically and accurately and to implement the provisions strictly. In order to help the grassroots public security police to complete the management of property-related work, for the business knowledge visual question answering model of property-related business knowledge information is missing, the recall of retrieval augmented generation is low, and the performance of the model inference is poor and so on, the business knowledge visual question answering technology architecture for property-related work is proposed. The multimodal large language model is used to rewrite the complementary problem according to the image, which solves the problem that the direct retrieval cannot hit the relevant information. The knowledge base as well as the query is vectorized through the Conan-embedding model, which improves the knowledge retrieval ability of the model. This paper constructs a visual question answering dataset and public security knowledge base on the business knowledge of case-related property, and for the type of legal and regulatory documents, this paper discards the conventional fixed block storage method and uses dynamic segmentation technology to save the data according to the article. The use of LongLLMLingua model compresses the retrieved external knowledge according to the rewritten questions, which improves the model accuracy on the basis of effectively lowering context length. Experimental results show that the accuracy of the method proposed in this paper reaches 71.98%, which is improved by 18.68 percentage points compared with the direct use of GLM-V, and is better than other baseline models, which verifies the effectiveness of the method.

Key words: visual question answering, retrieval augmented generation, people??s public security case-related property, large language model, multimodal large language models

摘要： 作为公安机关办理行政及刑事案件流程中的关键环节，涉案财物的规范化管理与依法处置直接关系到执法活动的公正性和程序合法，近年来有关部门相继出台了一系列规范性法律文件以指导公安民警开展此项工作。然而，不同涉案财物的调查取证、查封、扣押、冻结、接收等流程不尽相同，相关民警难以科学准确理解，严格执行规定。为帮助基层公安民警完成涉案财物管理工作，针对业务知识视觉问答模型涉案财物业务知识信息缺失、常规检索增强生成技术召回率低、模型推理性能差等问题，提出了涉案财物业务知识视觉问答技术框架。使用多模态大模型根据图像改写补全问题，解决了直接检索无法命中相关信息的问题。通过Conan-embedding模型对知识库以及查询进行向量化，提高了模型的知识检索能力。构建了关于涉案财物业务知识视觉问答数据集与公安知识库，针对法律法规类型文件，摒弃常规的固定切块储存方法，采用动态分割技术按条保存数据。使用LongLLMLingua模型根据改写后的问题对检索的外部知识进行压缩，在有效降低上下文长度的基础上提高了模型准确率。实验结果显示，提出的方法准确率达到71.98%，相较直接使用GLM-V提升了18.68个百分点，优于其他基线模型，验证了该方法的有效性。

关键词: 视觉问答, 检索增强生成, 公安涉案财物, 大语言模型, 多模态大模型

XUE Di, LI Xin, JIANG Zhangtao, WANG Xiaoyu, LIU Mingshuai. Technical Framework for Visual Question Answering System of Business Knowledge for Case-Related Property[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(12): 3267-3278.

薛迪, 李欣, 蒋章涛, 王晓宇, 刘明帅. 面向涉案财物的业务知识视觉问答技术框架[J]. 计算机科学与探索, 2025, 19(12): 3267-3278.

References

[1] 李晓明, 李锋. 刑事案件涉案财物处置的困境与应对: 以J省N市检察机关评查的1299件刑事案件为样本[J]. 人民检察, 2024(22): 69-71.
LI X M, LI F. The disposal dilemma of property involved in criminal cases and its response: a study based on a sample of 1299 criminal cases reviewed by the procuratorate of N city J province[J]. People??s Procuratorial Semimonthly, 2024(22): 69-71.
[2] ANTOL S, AGRAWAL A, LU J S, et al. VQA: visual question answering[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2425-2433.
[3] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6325-6334.
[4] BAO H B, WANG W H, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[EB/OL]. [2024-12-19]. https://arxiv.org/abs/2111.02358.
[5] WANG W H, BAO H B, DONG L, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19175-19186.
[6] CHEN X, WANG X, CHANGPINYO S, et al. PaLI: a jointly-scaled multilingual language-image model[EB/OL]. [2024-12-19]. https://arxiv.org/abs/2209.06794.
[7] YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 3081-3089.
[8] HU Y S, HUA H, YANG Z Y, et al. PromptCap: prompt-guided image captioning for VQA with GPT-3[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2951-2963.
[9] SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14974-14983.
[10] HU Z J, YANG P, JIANG Y S, et al. Prompting large language model with context and pre-answer for knowledge-based VQA[J]. Pattern Recognition, 2024, 151: 110399.
[11] ZHU Z H, YU J, WANG Y J, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[EB/OL]. [2024-12-19]. https://arxiv.org/abs/2006. 09073.
[12] MARINO K, CHEN X L, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14111-14121.
[13] LAURIOLA I, LAVELLI A, AIOLLI F. An introduction to deep learning in natural language processing: models, techniques, and tools[J]. Neurocomputing, 2022, 470: 443-456.
[14] ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT[J/OL]. International Journal of Machine Learning and Cybernetics [2024-12-20]. https://doi.org/10.1007/s13042-024-02443-6.
[15] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368.
[16] LIU Y, ZHANG Y, WANG Y X, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498.
[17] JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1988-1997.
[18] HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 6693-6702.
[19] SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 8317-8326.
[20] MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the 2019 International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2019: 947-952.
[21] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3195-3204.
[22] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162.
[23] WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 2712-2721.
[24] VRANDE?I? D, KR?TZSCH M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
[25] SPEER R, CHIN J, HAVASI C. ConceptNet 5.5: an open multilingual graph of general knowledge[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4444-4451.
[26] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 1877-1901.
[27] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2302.13971.
[28] HONG W Y, WANG W H, DING M, et al. CogVLM2: visual language models for image and video understanding[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2408.16500.
[29] ABDIN M, ANEJA J, AWADALLA H, et al. Phi-3 technical report: a highly capable language model locally on your phone[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2404.14219.
[30] SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. [2024-12-20]. https://arxiv. org/abs/2402.07927.
[31] DONG Q X, LI L, DAI D M, et al. A survey on in-context learning[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2301. 00234.
[32] WEI J, WANG X Z, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022: 24824-24837.
[33] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[C]//Advances in Neural Information Processing Systems 33, 2020: 9459-9474.
[34] LI S Y, TANG Y, CHEN S Z, et al. Conan-embedding: general text embedding with more and better negative samples[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2408.15710.
[35] HE B L, CHEN N, HE X R, et al. Retrieving, rethinking and revising: the chain-of-verification can improve retrieval augmented generation[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2410.05801.
[36] GLM T, ZENG A H, XU B, et al. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2406.12793.
[37] JIANG H Q, WU Q H, LUO X F, et al. LongLLMLingua: accelerating and enhancing LLMs in long context scenarios via prompt compression[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 1658-1677.
[38] LIU H, LI C, LI Y, et al. LLaVA-NeXT: improved reasoning, OCR, and world knowledge[EB/OL]. [2024-12-21]. https://llava-vl.github.io/blog/2024-01-30-llava-next/.
[39] RAM O, LEVINE Y, DALMEDIGOS I, et al. In-context retrieval-augmented language models[J]. Transactions of the Association for Computational Linguistics, 2023, 11: 1316-1331.
[40] WANG Y, SUN Q, HE S. M3E: moka massive mixed embedding model[EB/OL]. [2024-12-21]. https://github.com/wangyingdong/m3e-base/blob/main/README.md.