面向互联网资源的医学命名实体识别研究

doi:10.3778/j.issn.1673-9418.1709045

计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (6): 898-907.DOI: 10.3778/j.issn.1673-9418.1709045

面向互联网资源的医学命名实体识别研究

田家源1，杨东华1,2+，王宏志1

1. 哈尔滨工业大学计算机科学与技术学院，哈尔滨 150001
2. 哈尔滨工业大学基础与交叉科学研究院，哈尔滨 150001

出版日期:2018-06-01 发布日期:2018-06-06

Research of Medical Named Entity Recognition Based on Internet Resources

TIAN Jiayuan1, YANG Donghua1,2+, WANG Hongzhi1

1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
2. Academy of Fundamental and Interdisciplinary Sciences, Harbin Institute of Technology, Harbin 150001, China

Online:2018-06-01 Published:2018-06-06

摘要/Abstract

摘要： 医学信息提取的第一步在于命名实体识别，然而公开医学语料的缺乏使得这项工作困难重重。已有的研究大都建立在少量人工标注的文本之上，不具备很好的推广性。互联网作为大量数据的聚集地，可以从中进行医学知识的提取。针对互联网资源规模大，结构化程度低，缺乏标注等特点，提出了一种迭代式框架来对其加以利用。使用融合通用模型和领域词典的方法对文本进行标注，缓解了领域不同带来的精度降低问题。使用在线方法来构建模型，避免了迭代中对模型进行整体重构。在命名实体识别模型中融入了词法特征、词缀特征、词长特征等，提高了模型的识别能力。提出了一种启发式的模型压缩方法，增强模型的可用性。实验结果表明，所提出的策略是有效的。

关键词: 命名实体识别, 互联网资源, 迭代框架, 平均感知器

Abstract: The first step of medical information extraction is named entity recognition, but the lack of open medical corpus makes it rather difficult. Existing work commonly relies on a small amount of manually annotated texts, so that it can??t be widely promoted. As a collection of large amounts of data, the Internet can be used to extract medical knowledge. Considering the size and characteristic of Internet, this paper proposes an iterative framework to exploit it. In order to deal with the effect drop of domain differences, a method of fusing universal model and domain dictionary is used to annotate the text. To avoid retraining the model, an online method is used to build the model. This paper integrates multiple features in the model, including lexical features, affixes features, word length features and so on. Besides, this paper gives a heuristic model compression method to enhance the usability of the model. The experimental results show that the proposed strategies are effective.

Key words: named entity recognition, Internet resources, iterative framework, average perceptron

田家源，杨东华，王宏志. 面向互联网资源的医学命名实体识别研究[J]. 计算机科学与探索, 2018, 12(6): 898-907.

TIAN Jiayuan, YANG Donghua, WANG Hongzhi. Research of Medical Named Entity Recognition Based on Internet Resources[J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(6): 898-907.

[1]	李猛，李艳玲，林民. 命名实体识别的迁移学习研究综述[J]. 计算机科学与探索, 2021, 15(2): 206-218.
[2]	韩鑫鑫，贲可荣，张献. 军用软件测试领域的命名实体识别技术研究[J]. 计算机科学与探索, 2020, 14(5): 740-748.
[3]	李冬梅，檀稳. 植物属性文本的命名实体识别方法研究[J]. 计算机科学与探索, 2019, 13(12): 2085-2093.
[4]	付宇新，王鑫，冯志勇，徐强. DBpedia Spotlight上的命名实体识别优化[J]. 计算机科学与探索, 2017, 11(7): 1044-1055.

面向互联网资源的医学命名实体识别研究

Research of Medical Named Entity Recognition Based on Internet Resources

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics