计算机科学与探索 ›› 2017, Vol. 11 ›› Issue (9): 1389-1397.DOI: 10.3778/j.issn.1673-9418.1609004

• 数据库技术 • 上一篇    下一篇

利用有序互信息匹配包含非透明列的数据模式

郭乐乐+,林友芳,韩  升   

  1. 北京交通大学 计算机与信息技术学院 交通数据分析与挖掘北京市重点实验室,北京 100044
  • 出版日期:2017-09-01 发布日期:2017-09-06

Using Ordered Mutual Information to Match Schema with Opaque Column Names and Data Values

GUO Lele+, LIN Youfang, HAN Sheng   

  1. Beijing Key Laboratory of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Online:2017-09-01 Published:2017-09-06

摘要: 数据模式匹配是异构数据源数据合并过程中的核心环节,属于数据集成中的关键问题。目前已有许多数据模式匹配方法,但其中很大一部分方法由于过多依赖数据模式描述信息,导致通用性不足,很难应用于其他场景中。为此,提出了一种利用有序互信息的匹配包含非透明列名和列数据值的数据模式。该方法不依赖诸如列名、列类型、主外键依赖等数据模式描述信息,因此具有很强的通用性。在多个数据集上实验结果表明,该方法能够在大幅降低匹配花费时间的同时提高匹配结果的准确率。

关键词: 数据模式匹配, 非透明条件, 互信息, 无向图匹配

Abstract: As a key issue of data integration, schema matching is the core task in data merging process of heterogeneous data sources. At present, a mass of schema matching methods have been proposed. However, most of them are lack of universality since they depend on the description information of schema heavily. Therefore, it is difficult to apply these approaches to other scenarios. To solve the problem, this paper proposes a novel schema matching method which uses ordered mutual information and does not rely on any description information of schema, such as column name, column type and foreign constraints, which make it own a strong universality. Furthermore, extensive experiments on various datasets indicate that the proposed technique outperforms earlier schema matching methods in terms of efficiency and accuracy.

Key words: schema matching, opaque conditions, mutual information, undirected graph matching