Research on Case Information Element Extraction Method Based on Instruction Fine-Tuning of Large Language Models

doi:10.3778/j.issn.1673-9418.2412085

Abstract

Abstract: With the rapid development of artificial intelligence technology, the strategy of “technology-driven policing” has become an important way to enhance the modernization level of public security work. Under the background of technology-driven policing, public security organs are faced with the demand for processing a large amount of unstructured case text information, and the traditional manual processing method can no longer meet the current work requirements. Large language models, as an emerging artificial intelligence technology, have strong language understanding and generation capabilities, and can automatically extract key information elements from case texts, such as involved personnel, time, location, and case nature, providing strong support for case analysis, evidence collection, and decision support. This paper aims to study the method of case information element extraction based on instruction fine-tuning of large language models, in order to improve the efficiency and accuracy of public security organs in case information processing through advanced natural language processing technology, and further promote the informatization process of public security work. The research enhances the information extraction capability of large language models through techniques such as efficient fine-tuning via LoRA, instruction fine-tuning, data augmentation, and in-context learning. Experimental results show that this method achieves significant performance improvement on the self-built case text dataset, with both extraction accuracy and recall being better than traditional methods.

Key words: large language model, information extraction, instruction fine-tuning, police affairs, named entity recognition

摘要： 当前随着人工智能技术的快速发展，科技兴警战略已成为提升公安工作现代化水平的重要途径。在科技兴警的大背景下，公安机关面临着海量的非结构化案件文本信息处理需求，传统的人工处理方式已难以满足当前的工作要求。大语言模型作为一种新兴的人工智能技术，具备强大的语言理解和生成能力，能够自动从案件文本中抽取涉案人员、时间、地点、案件性质等关键信息要素，为案件分析、证据收集和决策支持提供有力支撑。因此研究基于指令微调大语言模型的案件信息要素抽取方法，以期通过先进的自然语言处理技术提高公安机关在案件信息处理上的效率和准确性，进一步推动公安工作信息化进程。该研究通过高效微调技术LoRA、指令微调、数据增强、情境学习等技术提升大语言模型的信息抽取能力。实验结果表明，该方法在自建的案件文本数据集上取得了显著的性能提升，抽取准确率和召回率均优于传统方法。

关键词: 大语言模型, 信息抽取, 指令微调, 公安业务, 命名实体识别

WANG Jintao, MENG Qixiang, GAO Zhilin, BU Fanliang. Research on Case Information Element Extraction Method Based on Instruction Fine-Tuning of Large Language Models[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2161-2173.

王劲滔, 孟琪翔, 高志霖, 卜凡亮. 基于大语言模型指令微调的案件信息要素抽取方法研究[J]. 计算机科学与探索, 2025, 19(8): 2161-2173.

References

[1] 马忠红. 论刑事案件的构成要素[J]. 中国人民公安大学学报(社会科学版), 2012, 28(5): 91-99.
MA Z H. The constitutive elements of a criminal case[J]. Journal of Chinese People??s Public Security University (Social Sciences Edition), 2012, 28(5): 91-99.
[2] 陈剑, 何涛, 闻英友, 等. 基于BERT模型的司法文书实体识别方法[J]. 东北大学学报(自然科学版), 2020, 41(10): 1382-1387.
CHEN J, HE T, WEN Y Y, et al. Entity recognition method for judicial documents based on BERT model[J]. Journal of Northeastern University (Natural Science), 2020, 41(10): 1382-1387.
[3] 鲍彤, 章成志. ChatGPT中文信息抽取能力测评: 以三种典型的抽取任务为例[J]. 数据分析与知识发现, 2023, 7(9): 1-11.
BAO T, ZHANG C Z. Extracting Chinese information with ChatGPT: an empirical study by three typical tasks[J]. Data Analysis and Knowledge Discovery, 2023, 7(9): 1-11.
[4] 冯钧, 畅阳红, 陆佳民, 等. 基于大语言模型的水工程调度知识图谱的构建与应用[J]. 计算机科学与探索, 2024, 18(6): 1637-1647.
FENG J, CHANG Y H, LU J M, et al. Construction and application of knowledge graph for water engineering scheduling based on large language model[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(6): 1637-1647.
[5] 裴炳森, 李欣, 吴越. 基于ChatGPT的电信诈骗案件类型影响力评估[J]. 计算机科学与探索, 2023, 17(10): 2413-2425.
PEI B S, LI X, WU Y. Influence evaluation of telecom fraud case types based on ChatGPT[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(10): 2413-2425.
[6] POLAK M P, MORGAN D. Extracting accurate materials data from research papers with conversational language models and prompt engineering[J]. Nature Communications, 2024, 15: 1569.
[7] 田萍芳, 刘恒永, 高峰, 等. 基于大语言模型的本体提示指导的司法命名实体识别[J]. 武汉大学学报(理学版), 2025, 71(2): 219-231.
TIAN P F, LIU H Y, GAO F, et al. Judicial named entity recognition by ontology prompt guidance based on large language model[J]. Journal of Wuhan University (Natural Science Edition), 2025, 71(2): 219-231.
[8] 李春楠, 王雷, 孙媛媛, 等. 基于BERT的盗窃罪法律文书命名实体识别方法[J]. 中文信息学报, 2021, 35(8): 73-81.
LI C N, WANG L, SUN Y Y, et al. BERT based named entity recognition for legal texts on theft cases[J]. Journal of Chinese Information Processing, 2021, 35(8): 73-81.
[9] 曾兰兰, 王以松, 陈攀峰. 基于BERT和联合学习的裁判文书命名实体识别[J]. 计算机应用, 2022, 42(10): 3011-3017.
ZENG L L, WANG Y S, CHEN P F. Named entity recognition based on BERT and joint learning for judgment documents[J]. Journal of Computer Applications, 2022, 42(10): 3011-3017.
[10] FENG S Y, GANGAL V, KANG D, et al. GenAug: data augmentation for finetuning text generators[C]//Proceedings of Deep Learning Inside Out (DeeLIO): The 1st Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Stroudsburg: ACL, 2020: 29-42.
[11] BOGDANOV S, CONSTANTIN A, BERNARD T, et al. NuNER: entity recognition encoder pre-training via LLM-annotated data[C]//Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2024: 11829-11841.
[12] YE J J, XU N, WANG Y K, et al. LLM-DA: data augmentation via large language models for few-shot named entity recognition[EB/OL]. [2024-10-21]. https://arxiv.org/abs/2402.14568.
[13] SANTOSO J, SUTANTO P, CAHYADI B, et al. Pushing the limits of low-resource NER using LLM artificial data generation[C]//Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: ACL, 2024: 9652-9667.
[14] LYU S F, SUN L H, YI H X, et al. Converse attention knowledge transfer for low-resource named entity recognition[EB/OL]. [2024-10-21]. https://arxiv.org/abs/1906.01183.
[15] JAIN A, PARANJAPE B, LIPTON Z C. Entity projection via machine translation for cross-lingual NER[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 1083-1092.
[16] 方晔玮, 王铭涛, 陈文亮, 等. 基于自动弱标注数据的跨领域命名实体识别[J]. 中文信息学报, 2022, 36(3): 73-81.
FANG Y W, WANG M T, CHEN W L, et al. Cross-domain NER using automatically partial-annotated data[J]. Journal of Chinese Information Processing, 2022, 36(3): 73-81.
[17] ZHOU R, LI X, BING L D, et al. Improving self-training for cross-lingual named entity recognition with contrastive and prototype learning[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 4018-4031.
[18] YANG Z L, SALAKHUTDINOV R, COHEN W W. Transfer learning for sequence tagging with hierarchical recurrent networks[EB/OL]. [2024-10-21]. https://arxiv.org/abs/1703.06345.
[19] 丁建平, 李卫军, 刘雪洋, 等. 命名实体识别研究综述[J]. 计算机工程与科学, 2024, 46(7): 1296-1310.
DING J P, LI W J, LIU X Y, et al. A review of named entity recognition research[J]. Computer Engineering & Science, 2024, 46(7): 1296-1310.
[20] WEI X, CUI X Y, CHENG N, et al. ChatIE: zero-shot information extraction via chatting with ChatGPT[EB/OL]. [2024-10-21]. https://arxiv.org/abs/2302.10205.
[21] WANG S, SUN X, LI X, et al. GPT-NER: named entity recognition via large language models[EB/OL]. [2024-10-21]. https://arxiv.org/abs/2304.10428.
[22] JUNG S J, KIM H, JANG K S. LLM based biological named entity recognition from scientific literature[C]//Proceedings of the 2024 IEEE International Conference on Big Data and Smart Computing. Piscataway: IEEE, 2024: 433-435.
[23] ZHAN Z F, ZHOU S, ZHOU H X, et al. An evaluation of DeepSeek models in biomedical natural language processing[EB/OL]. [2025-03-12]. https://arxiv.org/abs/2503.00624.
[24] HU E J, WALLIS P, ALLEN-ZHU Z, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2024-10-21]. https://arxiv.org/abs/2106.09685.
[25] GUI H H, YUAN L, YE H B, et al. IEPile: unearthing large scale schema-conditioned information extraction corpus[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 127-146.
[26] MA Y X, SHAO Y Q, WU Y Y, et al. LeCaRD: a legal case retrieval dataset for Chinese law system[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2021: 2342-2348.
[27] 苏剑林. 鱼与熊掌兼得：融合检索和生成的SimBERT 模型[EB/OL]. [2024-10-21]. https://spaces.ac.cn/archives/7427.
SU J L. Fish and bear??s paw: SimBERT model for fusion of retrieval and generation[EB/OL]. [2024-10-21]. https://spaces. ac.cn/archives/7427.
[28] WEI J, ZOU K. EDA: easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 6381-6387.
[29] PEREZ E, KIELA D, CHO K. True few-shot learning with language models[C]//Advances in Neural Information Processing Systems 34, 2021: 11054-11070.
[30] LU Y, BARTOLO M, MOORE A, et al. Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 8086-8098.
[31] WU S, SONG X N, FENG Z H. MECT: multi-metadata embedding based cross-transformer for Chinese named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 1529-1539.
[32] LI J Y, FEI H, LIU J, et al. Unified named entity recognition as word-word relation classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(10): 10965-10973.
[33] 刘权, 余正涛, 高盛祥, 等. 融合案件要素的相似案例匹配[J]. 中文信息学报, 2022, 36(11): 140-147.
LIU Q, YU Z T, GAO S X, et al. Incorporating case elements for case matching[J]. Journal of Chinese Information Processing, 2022, 36(11): 140-147.
[34] 曹发鑫, 孙媛媛, 王治政, 等. 面向借贷案件的相似案例匹配模型[J]. 计算机工程, 2024, 50(1): 306-312.
CAO F X, SUN Y Y, WANG Z Z, et al. Similar case matching model for lending cases[J]. Computer Engineering, 2024, 50(1): 306-312.
[35] 李林睿, 王东升, 范红杰. 基于法条知识的事理型类案检索方法[J]. 浙江大学学报(工学版), 2024, 58(7): 1357-1365.
LI L R, WANG D S, FAN H J. Fact-based similar case retrieval methods based on statutory knowledge[J]. Journal of Zhejiang University (Engineering Science), 2024, 58(7): 1357-1365.