Journal of Frontiers of Computer Science and Technology ›› 2010, Vol. 4 ›› Issue (7): 599-607.DOI: 10.3778/j.issn.1673-9418.2010.07.003

• 学术研究 • Previous Articles     Next Articles

A Duplicate Web Entity Identification Approach Based on Iterative Training*

LIU Wei+; XIAO Jianguo

  

  1. Institute of Computer Science & Technology, Peking University, Beijing 100871, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-07-14 Published:2010-07-14
  • Contact: LIU Wei

多Web数据源环境下的重复实体识别方法研究*

刘 伟+; 肖建国

  

  1. 北京大学 计算机科学技术研究所, 北京 100871
  • 通讯作者: 刘 伟

Abstract: A large number of Web data sources that can be accessed online make users convenient to obtain their desired information. As the necessary step in Web data integration, the duplicate Web entities with various presentations should be identified accurately from Web data sources. To the best of our knowledge, previous works focus on this issue only between two data sources. The large quantity of Web data sources make these approaches unpractical. To this end, an effective iterative-training-based approach is proposed to address this issue of duplicate Web entity identification, which can be applied to multiple Web data sources using a small training set. The extensive experi-ments on book domain and computer domain validate the effectiveness of the proposed approach.

Key words: Web entity, duplicate entity identification, Web data integration, iterative training

摘要: Web中大量可访问的数据源为人们获取有用的信息带来了极大的便利。作为Web数据源集成的一个必要的步骤, 需要将存在于不同数据源表达形式各异的重复Web实体准确地识别出来。在已有的重复实体识别的工作中, 主要是在两个数据源之间进行。由于Web数据源数量众多, 使得这些方法无法应用于多个Web数据源之间的重复实体识别。针对这个问题提出了一种基于迭代训练的Web重复实体识别方法, 可以在较小规模的训练样本上实现在多个Web数据源上的重复实体识别。通过在图书和计算机产品两个不同领域中多个Web数据源上的广泛实验, 表明了提出方法的有效性。

关键词: Web实体, 重复实体识别, Web数据集成, 迭代训练

CLC Number: