Journal of Frontiers of Computer Science and Technology ›› 2017, Vol. 11 ›› Issue (7): 1044-1055.DOI: 10.3778/j.issn.1673-9418.1607015

Previous Articles     Next Articles

Named Entity Recognition Optimization on DBpedia Spotlight

FU Yuxin1,2, WANG Xin1,2+, FENG Zhiyong2,3, XU Qiang1,2   

  1. 1. School of Computer Science and Technology, Tianjin University, Tianjin 300354, China
    2. Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin 300354, China
    3. School of Computer Software, Tianjin University, Tianjin 300354, China
  • Online:2017-07-01 Published:2017-07-07

DBpedia Spotlight上的命名实体识别优化

付宇新1,2,王  鑫1,2+,冯志勇2,3,徐  强1,2   

  1. 1. 天津大学 计算机科学与技术学院,天津 300354
    2. 天津市认知计算与应用重点实验室,天津 300354
    3. 天津大学 软件学院,天津 300354

Abstract: The task of named entity recognition can bridge the gap between knowledge bases and nature languages, and support the research work in keyword extraction, machine translation, topic detection and tracking, etc. Based on the analysis of current research in the field of named entity recognition, this paper proposes a general-purpose optimization scheme for named entity recognition. Firstly, this paper designs and implements an incremental extending method, by using a candidate set, which can reduce the dependency on the training set. Secondly, by leveraging the concept of pointwise mutual information ratio, this paper effectively makes feature selection on the contexts of entities, which may reduce the context space significantly and meanwhile improve the performance of annotation results. Finally, this paper presents the secondary disambiguation method based on topic vectors, which can further enhance the precision of annotation. This paper conducts extensive comparison experiments on the widely-used open-source named entity recognition system DBpedia Spotlight. It has been verified that the proposed optimization scheme outperforms the state-of-the-art methods.

null

Key words: named entity recognition, linked data, DBpedia Spotlight

摘要: 命名实体识别任务能够搭建知识库与自然语言之间的桥梁,为关键字提取、机器翻译、主题检测与跟踪等研究工作提供支撑。通过对目前命名实体识别领域的相关研究进行分析,提出了一套通用的命名实体识别优化方案。首先,设计并实现了利用候选集的增量式扩展方法,降低了对训练集的依赖性;其次,通过点互信息率对实体上下文进行特征选择,大幅度降低了上下文空间,同时提高了标注性能;最后,提出了基于主题向量的二次消歧方法,进一步增强了标注准确率。通过在广泛使用的开源命名实体识别系统DBpedia Spotlight上进行多种比较实验,验证了所提优化方案与已有系统相比具有较优的性能指标。

关键词: 命名实体识别, 链接数据, DBpedia Spotlight