基于预训练模型的医药说明书实体抽取方法研究

doi:10.3778/j.issn.1673-9418.2304078

摘要/Abstract

摘要： 药品说明书医疗实体抽取可为用药信息智能检索及构建医疗知识图谱提供基础数据，具有重要研究意义与应用价值。针对治疗不同种类疾病的药品说明书中的医疗实体存在着较大的差异从而导致模型训练需要标注大量样本的问题，采用“大模型+小模型”的设计思路，提出了一种基于预训练模型的部分标签命名实体识别模型，先采用通过少量样本微调的预训练语言模型抽取药品说明书中的部分实体，再利用基于Transformer的部分标签模型进一步优化实体提取结果。部分标签模型采用平面格结构对输入文本、已识别出的部分实体及实体标签进行编码，使用Transformer提取特征表示，最后通过条件随机场（CRF）预测实体标签。为了减少训练模型的标注数据，利用标注样本实体掩盖策略，提出一种样本数据增广方法对部分标签模型进行训练。实验验证了“大模型+小模型”在医疗实体抽取的可行性，结果表明精确率（precision，P）、召回率（recall，R）和F1分数分别为85.0%、86.1%、85.6%，比其他学习方法更具优势。

关键词: 命名实体识别, 预训练模型, 医疗实体抽取, Transformer

Abstract: The extraction of medical entities from drug instructions provides fundamental data for the intelligent retrieval of medication information and the construction of medical knowledge graphs, with remarkable research significance and practical value. However, the heterogeneity of medical entities in drug instructions for treating different diseases poses challenges in model training, which requires a large number of annotated samples. To address this issue, a “large model + small model” design approach is used in this research. Specifically, this research proposes a part-label named entity recognition model based on a pre-trained model, which first employs a pre-trained language model fine-tuned on a small number of samples to extract partial entities from drug instructions, and then utilizes a Transformer- based part-label model to further optimize the entity extraction results. The part-label model encodes the input text, identified partial entities, and entity labels using a planar lattice structure, extracts feature representations using Transformer, and predicts entity labels through a conditional random fields (CRF) layer. To reduce the need for annotated training data, a sample data augmentation method is proposed using entity masking strategy on labeled samples to train the part-label model. Experimental results validate the feasibility of the “large model + small model” approach in medical entity extraction, with precision (P), recall (R), and F1 score of 85.0%, 86.1%, and 85.6%, respectively, demonstrating superior performance compared with other learning methods.

Key words: named entity recognition (NER), pre-trained models, medical entity extraction, Transformer

陈仲永, 黄雍圣, 张旻, 姜明. 基于预训练模型的医药说明书实体抽取方法研究[J]. 计算机科学与探索, 2024, 18(7): 1911-1922.

CHEN Zhongyong, HUANG Yongsheng, ZHANG Min, JIANG Ming. Study on Entity Extraction Method for Pharmaceutical Instructions Based on Pretrained Models[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1911-1922.

参考文献

[1] 刘凯, 周雪忠, 于剑, 等. 基于条件随机场的中医临床病历命名实体抽取[J]. 计算机工程, 2014, 40(9): 312-316.
LIU K, ZHOU X Z, YU J, et al. Named entity extraction of traditional Chinese medicine medical records based on conditional random field[J]. Computer Engineering, 2014, 40(9): 312-316.
[2] 李宗峬. 面向药品说明书的知识图谱构建及检索系统的设计与实现[D]. 济南: 山东大学, 2021.
LI Z B. Design and implementation of knowledge graph construction and retrieval system for drug inserts[D]. Jinan: Shandong University, 2021.
[3] 杜晋华, 尹浩, 冯嵩. 中文电子病历命名实体识别的研究与进展[J]. 电子学报, 2022, 50(12): 3030-3053.
DU J H, YIN H, FENG S. Research and development of named entity recognition in Chinese electronic medical record[J]. Acta Electronica Sinica, 2022, 50(12): 3030-3053.
[4] 曾青霞, 熊旺平, 杜建强, 等. 结合自注意力的BiLSTM-CRF的电子病历命名实体识别[J]. 计算机应用与软件, 2021, 38(3): 159-162.
ZENG Q X, XIONG W P, DU J Q, et al. Electronic medical record named entity recognition combined with self-attention BILSTM-CRF[J]. Computer Applications and Software, 2021, 38(3): 159-162.
[5] 张坤丽, 任晓辉, 庄雷, 等. 中文药品知识库的研究与构建[J]. 中文信息学报, 2022, 36(10): 45-53.
ZHANG K L, REN X H, ZHUANG L, et al. Research and construction of Chinese medicine knowledge base[J]. Journal of Chinese Information Processing, 2022, 36(10): 45-53.
[6] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, Jun 12-17, 2016. Stroudsburg: ACL, 2016: 260-270.
[7] 郭知鑫, 邓小龙. 基于BERT-BiLSTM-CRF的法律案件实体智能识别方法[J]. 北京邮电大学学报, 2021, 44(4): 129-134.
GUO Z X, DENG X L. Intelligent identification method of legal case entity based on BERT-BiLSTM-CRF[J]. Journal of Beijing University of Posts and Telecommunications, 2021, 44(4): 129-134.
[8] CHIU J P C, NICHOLS E. Named entity recognition with bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016, 4: 357-370.
[9] MA X, HOVY E. End-to-end sequence labeling via bidirectional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Aug 7-12, 2016. Stroudsburg: ACL, 2016: 1064-1074.
[10] SHAHZAD M, AMIN A, ESTEVES D, et al. InferNER: an attentive model leveraging the sentence-level information for named entity recognition in microblogs[C]//Proceedings of the 34th International Florida Artificial Intelligence Research Society Conference, North Miami Beach, May 17-19, 2021. Florida: Florida AI Research Society, 2021.
[11] LI J, FEI H, LIU J, et al. Unified named entity recognition as word-word relation classification[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, Feb 22-Mar 1, 2022. Menlo Park: AAAI, 2022: 10965-10973.
[12] ZHANG Y, YANG J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Jul 15-20, 2018. Stroudsburg: ACL, 2018: 1554-1564.
[13] KENTON J D M W C, TOUTANOVA L K. BERT: pretraining of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Associatio, Minneapolis, Jun 2-7 2019. Stroudsburg: ACL, 2019: 4171-4186.
[14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017.
[15] LI X, YAN H, QIU X, et al. FLAT: Chinese NER using flat-lattice transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 5-10, 2020. Stroudsburg: ACL, 2020: 6836-6842.
[16] WU S, SONG X, FENG Z. MECT: multi-metadata embedding based cross-transformer for Chinese named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Aug 1-6, 2021. Stroudsburg: ACL, 2021: 1529-1539.
[16] LU Y J, LIU Q, DAI D, et al. Unified structure generation for universal information extraction[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, May 22-27, 2022. Stroudsburg: ACL, 2022: 5755-5772.
[17] MA J, BALLESTEROS M, DOSS S, et al. Label semantics for few shot named entity recognition[C]//Proceedings of the Findings of the Association for Computational Linguistics, Dublin, May 22-27, 2022. Stroudsburg: ACL, 2022: 1956-1971.
[18] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Jul 28-Aug 2, 2019. Stroudsburg: ACL, 2019: 2978-2988.
[19] 谢腾, 杨俊安, 刘辉. 基于BERT-BiLSTM-CRF模型的中文实体识别[J]. 计算机系统应用, 2020, 29(7): 48-55.
XIE T, YANG J A, LIU H.Chinese entity recognition based on BERT-BiLSTM-CRF model[J]. Computer Systems & Applications, 2020, 29(7): 48-55.