计算机科学与探索 ›› 2012, Vol. 6 ›› Issue (11): 974-984.DOI: 10.3778/j.issn.1673-9418.2012.11.002

• 学术研究 • 上一篇    下一篇

数据空间中时间为中心的集合实体识别策略

杨  丹1,2+,申德荣1,于  戈1,聂铁铮1,寇  月1   

  1. 1. 东北大学 信息科学与工程学院,沈阳 110004
    2. 辽宁科技大学 软件学院,辽宁 鞍山 114051
  • 出版日期:2012-11-01 发布日期:2012-11-02

Time-Centered Collective Entity Resolution Strategy in Dataspace

YANG Dan1,2+, SHEN Derong1 , YU Ge1, NIE Tiezheng1, KOU Yue1   

  1. 1. School of Information Science and Engineering, Northeastern University, Shenyang 110004, China
    2. School of Software, University of Science and Technology Liaoning, Anshan, Liaoning 114051, China
  • Online:2012-11-01 Published:2012-11-02

摘要: 数据空间是一个异构的环境,并且数据及模式具有随时间演化的特性。已有的实体识别技术很少考虑时间信息在识别中所起的作用,并且没有考虑实体随时间演化的特性。针对数据空间中具有时间信息的实体识别,提出了一个四阶段的时间为中心的集合实体识别策略(time-centered collective entity resolution,T-CER)。T-CER在实体识别过程的不同阶段都考虑了时间信息所起的作用,在识别阶段提出了基于时间的聚类算法(time-based clustering,T-Clustering),并使用基于时间的约束对识别结果进行检查,以获得更精确的识别结果。在真实数据集上的大量实验结果表明了T-CER的可行性和有效性。

关键词: 数据空间, 集合实体识别, 时间信息

Abstract: Dataspace is a heterogeneous environment, and the data and schema both evolve with time. The existing entity resolution (ER) techniques seldom consider the role played by the temporal information in the ER process, and do not consider the characteristic of entity evolution with time. So aiming at ER with temporal information in the dataspace, this paper proposes a four-stage time-centered collective entity resolution (T-CER) strategy. Considering temporal information in each different stage of ER process, T-CER proposes a time-based clustering (T-Clustering) algorithm in resolution stage, and uses time-based constraints checking for further accurate ER results. Extensive experimental results on real world data sets show the effectiveness and correctness of T-CER.

Key words: dataspace, collective entity resolution, temporal information