计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (4): 912-921.DOI: 10.3778/j.issn.1673-9418.2107097

• 人工智能·模式识别 • 上一篇    下一篇

迁移学习在低资源场景实体识别中的应用研究

杜鹏,张有明,朱郑州,李国才   

  1. 北京大学 软件与微电子学院,北京 102600
  • 出版日期:2023-04-01 发布日期:2023-04-17

Study on Application of Transfer Learning in Entity Recognition of Low Resource Environment

DU Peng, ZHANG Youming, ZHU Zhengzhou, LI Guocai   

  1. School of Software and Microelectronics, Peking University, Beijing 102600, China
  • Online:2023-04-01 Published:2023-04-17

摘要: 实体识别是信息抽取工作中的一项基础性工作。目前在缺乏足够的标注语料的低资源场景下如何有效识别实体,仍是自然语言处理中的一项挑战性工作。结合预训练模型,采用一种“统一编码-分离解码”解决方案,学习大规模领域实体抽象边界信息,基于迁移学习,将大规模领域实体边界抽象信息迁移到低资源场景,提高低资源场景实体识别精度。与现有方法不同的是,仅在解码前对特征向量进行适配。设计了一种自适应模块对统一编码方式得到的每一特征向量按照目标域的实体类型和标注方式维度进行单独解码,确定每个实体的标注方式,避免复杂的实体嵌套问题。基于公开数据集的实验结果表明:相较于BERT-BiLSTM-CRF基线模型,在医药领域低资源场景下,精确率提高4个百分点,召回率提高5.4个百分点,[F1]提高4.72个百分点;在人事领域低资源场景下,精确率提高31.91个百分点,召回率提高31.7个百分点,[F1]提高31.86个百分点。基于自主采集整理数据集的实验结果也表明了模型在低资源场景下进行实体识别的有效性,相较于Lattice-BERT模型,在精确率、召回率等方面有所提高。

关键词: 迁移学习, 实体识别, 低资源场景, 序列标注

Abstract: Entity recognition is a basic work in information extraction. At present, how to recognize entities in low resource environment is still a challenging task in natural language processing. Combined with the pre-training model, a solution of “unified coding separate decoding” is adopted, which can learn the abstract boundary information of large-scale domain entities, and transfer the abstract boundary information of entities to low resource scenarios based on transfer learning. The model can effectively improve the accuracy of entity recognition tasks in low resource environment. Different from the existing methods, the feature vector is adapted only before the process of decoding. An adaptive module is designed to decode separately each feature vector obtained by the unified coding method,according to the entity type and annotation mode dimension of the target domain, determining how each entity is dimensioned, to avoid complex entity embedding problems. Experimental results based on public datasets show that: compared with the baseline model of BERT-BiLSTM-CRF, Precision is increased by 4 percentage points, Recall  is increased by 5.4 percentage points, and F1 is increased by 4.72 percentage points in the low resource scenario in the pharmaceutical field; in the low resource scenario in the personnel field, Precision is increased by 31.91 percentage points, Recall is increased by 31.7 percentage points, and F1 is increased by 31.86 percentage points. Experimental results based on autonomously collected and collated datasets also show the effectiveness of the model for entity recognition in low-resource scenarios, with improved accuracy and recall compared with Lattice-BERT model.

Key words: transfer learning, entity recognition, low resource scenario, sequence annotation