Journal of Frontiers of Computer Science and Technology ›› 2013, Vol. 7 ›› Issue (9): 819-830.DOI: 10.3778/j.issn.1673-9418.1305044

Previous Articles     Next Articles

Data-Oriented Method of Schema Matching Utilizing Information Theory

ZHAO Chenlu+, SHEN Derong, KOU Yue, NIE Tiezheng, YU Ge   

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110004, China
  • Online:2013-09-01 Published:2013-09-04

应用信息论的数据导向模式匹配方法

赵晨露+,申德荣,寇  月,聂铁铮,于  戈   

  1. 东北大学 信息科学与工程学院,沈阳 110004

Abstract: The development of network has led to larger and more heterogeneous datasets. In order to employ these heterogeneous data, people usually use data integration, and schema matching is the core technology of data integration. However, these data sets are always with typical heterogeneity and may have problems such as duplicate records, missing values or lack of schema information, which lead to the inapplicability of traditional schema matching technology. For this end, this paper focuses on the study of schema matching in a situation where the schema information is unknown or incomplete, and proposes a schema matching model utilizing information theory. The model is totally based on the characteristics of the data distribution they contain, and without assumption of the existence of any external knowledge. It can compute all the similarities between columns accurately, and describe the characteristics of data distribution of each attribute column and the relations between them. This paper provides algorithms for constructing original data distribution graph and evolutive data distribution graph, which help to describe the relationship between attribute columns formally. The comprehensive experimental evaluation on real datasets verifies the feasibility and effectiveness of the proposed method.

Key words: schema matching, data-oriented, information theory model

摘要: 随着计算机网络的发展,许多复杂庞大的异构数据集应运而生。为了有效利用这些异构数据,通常采用数据集成的方法,其中模式匹配是数据集成的核心技术。然而,许多数据集具有典型的异构性,并可能存在有重复数据、缺失数据、模式信息缺失等问题,导致传统的模式匹配技术无法适用。为此,研究了模式信息未知或者不完整情况下的模式匹配问题,提出了应用信息论的模式匹配模型。该模型完全基于数据分布的特点而不依赖于任何外部知识,能够准确地计算出属性列之间的相似度,并有效地描述数据集中各个属性列数据的分布特点和属性列之间的关联关系。还提出了构建原始数据分布图和演化数据分布图的算法,从而形式化地表达出属性列之间的关系,达到匹配的目的。在真实数据集上的综合实验评估证明了方法的可行性和有效性。

关键词: 模式匹配, 数据导向, 信息论模型