Journal of Frontiers of Computer Science and Technology ›› 2015, Vol. 9 ›› Issue (10): 1238-1246.DOI: 10.3778/j.issn.1673-9418.1409019

Previous Articles     Next Articles

Online Encyclopedia Entities Tagging Method Based on Page Structure and Content

LI Xiaojing1,2+, LIN Hailun1,2, JIA Yantao1, WANG Yuanzhuo1, CHENG Xueqi1   

  1. 1. Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Online:2015-10-01 Published:2015-09-29

融合页面结构与内容的在线百科实体标注方法

李晓静1,2+,林海伦1,2,贾岩涛1,王元卓1,程学旗1   

  1. 1. 中国科学院 计算技术研究所 网络数据科学与技术重点实验室,北京 100190
    2. 中国科学院大学,北京 100049

Abstract: Online encyclopedia entity tagging aims to label online encyclopedia pages with standard named entity tags such as person, location and organization. It is crucial for a wide range of applications such as entity disambiguation, entity relation inference and knowledge base construction and so on. Features of encyclopedia pages can be divided as structure features (e.g., Infobox, title, and category) and content features (i.e., page content). Existing methods that only take one feature or simply combine both features in one classification cause low F1 value. These methods don’t make full use of the difference of these features. This paper presents an online encyclopedia entities tagging method based on page structure and content. This method firstly builds two classifiers with the two kinds of features respectively, and then builds a new classifier by linear combination of these two classifiers, so this method can accurately realize entities tagging. The experimental results show that this method can achieve F1 value improvement over the baseline methods on the task of encyclopedia entity tagging.

Key words: entity tagging, online encyclopedia, named entity, entity classification

摘要: 在线百科实体标注目的是标注出属于特定类别(如人名、地名、机构名等)的实体。百科实体标注对大量的应用,诸如实体消歧、实体关系挖掘、知识库构建都很重要。百科实体特征可以分为结构特征(属性框、标题、类别等)和内容特征(页面正文)。现有的标注方法大多只考虑一种特征或者一种分类器,导致F1值较低,无法充分发挥两种特征的优势。因此,提出了融合页面结构特征和内容特征的在线百科实体标注方法。该方法考虑了两种特征对标注结果的影响,分别构造分类器,并且对结果进行线性组合,能够更准确地实现百科实体的实体标注。实验表明,该方法在实体标注中F1值较其他对比实验方法均有所提高。

关键词: 实体标注, 在线百科, 命名实体, 实体分类