Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (7): 1911-1922.DOI: 10.3778/j.issn.1673-9418.2304078

• Artificial Intelligence·Pattern Recognition • Previous Articles     Next Articles

Study on Entity Extraction Method for Pharmaceutical Instructions Based on Pretrained Models

CHEN Zhongyong, HUANG Yongsheng, ZHANG Min, JIANG Ming   

  1. 1. Zhejiang Pharmaceutical Information Publicity and Development Service Center, Hangzhou 310061, China
    2. School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
  • Online:2024-07-01 Published:2024-06-28

基于预训练模型的医药说明书实体抽取方法研究

陈仲永, 黄雍圣, 张旻, 姜明   

  1. 1. 浙江省药品信息宣传和发展服务中心,杭州 310061
    2. 杭州电子科技大学 计算机学院,杭州 310018

Abstract: The extraction of medical entities from drug instructions provides fundamental data for the intelligent retrieval of medication information and the construction of medical knowledge graphs, with remarkable research significance and practical value. However, the heterogeneity of medical entities in drug instructions for treating different diseases poses challenges in model training, which requires a large number of annotated samples. To address this issue, a “large model + small model” design approach is used in this research. Specifically, this research proposes a part-label named entity recognition model based on a pre-trained model, which first employs a pre-trained language model fine-tuned on a small number of samples to extract partial entities from drug instructions, and then utilizes a Transformer- based part-label model to further optimize the entity extraction results. The part-label model encodes the input text, identified partial entities, and entity labels using a planar lattice structure, extracts feature representations using Transformer, and predicts entity labels through a conditional random fields (CRF) layer. To reduce the need for annotated training data, a sample data augmentation method is proposed using entity masking strategy on labeled samples to train the part-label model. Experimental results validate the feasibility of the “large model + small model” approach in medical entity extraction, with precision (P), recall (R), and F1 score of 85.0%, 86.1%, and 85.6%, respectively, demonstrating superior performance compared with other learning methods.

Key words: named entity recognition (NER), pre-trained models, medical entity extraction, Transformer

摘要: 药品说明书医疗实体抽取可为用药信息智能检索及构建医疗知识图谱提供基础数据,具有重要研究意义与应用价值。针对治疗不同种类疾病的药品说明书中的医疗实体存在着较大的差异从而导致模型训练需要标注大量样本的问题,采用“大模型+小模型”的设计思路,提出了一种基于预训练模型的部分标签命名实体识别模型,先采用通过少量样本微调的预训练语言模型抽取药品说明书中的部分实体,再利用基于Transformer的部分标签模型进一步优化实体提取结果。部分标签模型采用平面格结构对输入文本、已识别出的部分实体及实体标签进行编码,使用Transformer提取特征表示,最后通过条件随机场(CRF)预测实体标签。为了减少训练模型的标注数据,利用标注样本实体掩盖策略,提出一种样本数据增广方法对部分标签模型进行训练。实验验证了“大模型+小模型”在医疗实体抽取的可行性,结果表明精确率(precision,P)、召回率(recall,R)和F1分数分别为85.0%、86.1%、85.6%,比其他学习方法更具优势。

关键词: 命名实体识别, 预训练模型, 医疗实体抽取, Transformer