计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (6): 1395-1404.DOI: 10.3778/j.issn.1673-9418.2203129

• 人工智能·模式识别 • 上一篇    下一篇

融合ALBERT与规则的小麦病虫害命名实体识别

刘合兵,张德梦,熊蜀峰,马新明,席磊   

  1. 1. 河南农业大学 信息与管理科学学院,郑州 450046
    2. 农田监测与控制河南省工程实验室,郑州 450002
  • 出版日期:2023-06-01 发布日期:2023-06-01

Named Entity Recognition of Wheat Diseases and Pests Fusing ALBERT and Rules

LIU Hebing, ZHANG Demeng, XIONG Shufeng, MA Xinming, XI Lei   

  1. 1. College of Information and Management Sciences, Henan Agriculture University, Zhengzhou 450046, China
    2. Henan Engineering Laboratory of Farmland Monitoring and Control, Zhengzhou 450002, China
  • Online:2023-06-01 Published:2023-06-01

摘要: 小麦病虫害中文命名实体识别是构建该领域知识图谱的关键步骤,针对小麦病虫害领域训练数据匮乏、实体结构复杂、实体类型多样及实体分布不均匀等问题,在充分挖掘隐含知识的前提下,采用了两种数据增广方法扩充句子语义信息,构建了小麦病虫害实体识别语料库WpdCNER及其领域词典WpdDict,并在领域专家的指导下定义了16类实体;同时提出了一种基于规则修正的中文命名实体识别模型WPD-RA,该模型基于轻量级BERT+双向长短期记忆网络+条件随机场(ALBERT+BiLSTM+CRF)进行实体识别,并在识别后定义具体规则校准实体边界。融合规则后的ALBERT+BiLSTM+CRF模型取得了最好的识别结果,准确率为94.72%,召回率为95.23%,[F1]值为94.97%,相比不加规则的识别结果,其准确率、召回率、[F1]值分别增加了1.71个百分点、0.34个百分点、1.03个百分点。实验结果表明,该方法能有效识别小麦病虫害领域命名实体,识别性能优于其他模型,为食品安全、生物等其他领域命名实体识别提供了一种可借鉴的思路。

关键词: 小麦病虫害, 数据增广, 命名实体识别(NER), ALBERT, 规则修正

Abstract: Named entity recognition of wheat diseases and pests is a key step to building a knowledge graph. Aiming at the problems of lack of training data, complex entity structure, diverse entity types and uneven entity distribution in wheat diseases and pests field, under the promise of fully mining the implicit knowledge, two data augmentation methods are used to expand sentence semantic information, and to construct the corpus WpdCNER (wheat pests and diseases Chinese named entity recognition) and the field lexicon WpdDict (wheat pests and diseases dictionary). And 16 categories of entities are defined with the field experts’ guidance. Meanwhile, Chinese named entity recognition model based on rules amendment WPD-RA (wheat pests and disease-rules amendment model) is proposed. This model is carried out entity recognition based on ALBERT+BiLSTM+CRF (a lite bi-directional encoder representation from transformer + bi-directional long short-term memory + conditional random field), and specific rules are defined to amend entity boundaries after recognition. The WPD-RA model achieves the best results with 94.72% precision, 95.23% recall, and 94.97% F1. Its precision is increased by 1.71 percentage points, recall is increased by 0.34 percentage points, and F1 is increased by 1.03 percentage points, compared with the model without rules. Experimental results show that the model can effectively recognize named entities in wheat diseases and pests field, and its performance is better than other models. The proposed model provides a reference idea for named entity recognition task in other fields such as food safety and biology.

Key words: wheat diseases and pests, data augmentation, named entity recognition (NER), ALBERT, rules amendment