Journal of Frontiers of Computer Science and Technology ›› 2018, Vol. 12 ›› Issue (6): 898-907.DOI: 10.3778/j.issn.1673-9418.1709045

Previous Articles     Next Articles

Research of Medical Named Entity Recognition Based on Internet Resources

TIAN Jiayuan1, YANG Donghua1,2+, WANG Hongzhi1   

  1. 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
    2. Academy of Fundamental and Interdisciplinary Sciences, Harbin Institute of Technology, Harbin 150001, China
  • Online:2018-06-01 Published:2018-06-06

面向互联网资源的医学命名实体识别研究

田家源1,杨东华1,2+,王宏志1   

  1. 1. 哈尔滨工业大学 计算机科学与技术学院,哈尔滨 150001
    2. 哈尔滨工业大学 基础与交叉科学研究院,哈尔滨 150001

Abstract: The first step of medical information extraction is named entity recognition, but the lack of open medical corpus makes it rather difficult. Existing work commonly relies on a small amount of manually annotated texts, so that it can??t be widely promoted. As a collection of large amounts of data, the Internet can be used to extract medical knowledge. Considering the size and characteristic of Internet, this paper proposes an iterative framework to exploit it. In order to deal with the effect drop of domain differences, a method of fusing universal model and domain dictionary is used to annotate the text. To avoid retraining the model, an online method is used to build the model. This paper integrates multiple features in the model, including lexical features, affixes features, word length features and so on. Besides, this paper gives a heuristic model compression method to enhance the usability of the model. The  experimental results show that the proposed strategies are effective.

Key words: named entity recognition, Internet resources, iterative framework, average perceptron

摘要: 医学信息提取的第一步在于命名实体识别,然而公开医学语料的缺乏使得这项工作困难重重。已有的研究大都建立在少量人工标注的文本之上,不具备很好的推广性。互联网作为大量数据的聚集地,可以从中进行医学知识的提取。针对互联网资源规模大,结构化程度低,缺乏标注等特点,提出了一种迭代式框架来对其加以利用。使用融合通用模型和领域词典的方法对文本进行标注,缓解了领域不同带来的精度降低问题。使用在线方法来构建模型,避免了迭代中对模型进行整体重构。在命名实体识别模型中融入了词法特征、词缀特征、词长特征等,提高了模型的识别能力。提出了一种启发式的模型压缩方法,增强模型的可用性。实验结果表明,所提出的策略是有效的。

关键词: 命名实体识别, 互联网资源, 迭代框架, 平均感知器