利用有序互信息匹配包含非透明列的数据模式

doi:10.3778/j.issn.1673-9418.1609004

计算机科学与探索 ›› 2017, Vol. 11 ›› Issue (9): 1389-1397.DOI: 10.3778/j.issn.1673-9418.1609004

利用有序互信息匹配包含非透明列的数据模式

郭乐乐+，林友芳，韩升

北京交通大学计算机与信息技术学院交通数据分析与挖掘北京市重点实验室，北京 100044

出版日期:2017-09-01 发布日期:2017-09-06

Using Ordered Mutual Information to Match Schema with Opaque Column Names and Data Values

GUO Lele+, LIN Youfang, HAN Sheng

Beijing Key Laboratory of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

Online:2017-09-01 Published:2017-09-06

摘要/Abstract

摘要： 数据模式匹配是异构数据源数据合并过程中的核心环节，属于数据集成中的关键问题。目前已有许多数据模式匹配方法，但其中很大一部分方法由于过多依赖数据模式描述信息，导致通用性不足，很难应用于其他场景中。为此，提出了一种利用有序互信息的匹配包含非透明列名和列数据值的数据模式。该方法不依赖诸如列名、列类型、主外键依赖等数据模式描述信息，因此具有很强的通用性。在多个数据集上实验结果表明，该方法能够在大幅降低匹配花费时间的同时提高匹配结果的准确率。

关键词: 数据模式匹配, 非透明条件, 互信息, 无向图匹配

Abstract: As a key issue of data integration, schema matching is the core task in data merging process of heterogeneous data sources. At present, a mass of schema matching methods have been proposed. However, most of them are lack of universality since they depend on the description information of schema heavily. Therefore, it is difficult to apply these approaches to other scenarios. To solve the problem, this paper proposes a novel schema matching method which uses ordered mutual information and does not rely on any description information of schema, such as column name, column type and foreign constraints, which make it own a strong universality. Furthermore, extensive experiments on various datasets indicate that the proposed technique outperforms earlier schema matching methods in terms of efficiency and accuracy.

Key words: schema matching, opaque conditions, mutual information, undirected graph matching

郭乐乐，林友芳，韩升. 利用有序互信息匹配包含非透明列的数据模式[J]. 计算机科学与探索, 2017, 11(9): 1389-1397.

GUO Lele, LIN Youfang, HAN Sheng. Using Ordered Mutual Information to Match Schema with Opaque Column Names and Data Values[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(9): 1389-1397.

[1]	王金杰，李炜. 混合互信息和粒子群算法的多目标特征选择方法[J]. 计算机科学与探索, 2020, 14(1): 83-95.
[2]	荣垂田，李银银，王琰. 中文关键短语自动提取方法研究[J]. 计算机科学与探索, 2019, 13(9): 1481-1492.
[3]	马忱，姜高霞，王文剑. 面向函数型数据的动态互信息特征选择方法[J]. 计算机科学与探索, 2019, 13(1): 158-168.
[4]	夏维，王珊蕾，尹子都，岳昆. 基于互信息的知识图谱实体关联关系建模与补全[J]. 计算机科学与探索, 2018, 12(7): 1064-1074.
[5]	张维，苗夺谦，李峰. WilsonTh数据剪辑在邻域粗糙协同分类中的应用[J]. 计算机科学与探索, 2014, 8(9): 1092-1100.
[6]	王鑫，王熙照，陈建凯，翟俊海. 有序决策树的比较研究[J]. 计算机科学与探索, 2013, 7(11): 1018-1025.
[7]	周伟，王峰，王崇骏，谢俊元. 利用效用度挖掘核心药物及配伍规律[J]. 计算机科学与探索, 2013, 7(11): 994-1001.
[8]	吴昊1 , 李士进1+ , 林林2 , 万定生1 . 多策略结合的高光谱图像波段选择新方法*[J]. 计算机科学与探索, 2010, 4(5): 464-472.

利用有序互信息匹配包含非透明列的数据模式

Using Ordered Mutual Information to Match Schema with Opaque Column Names and Data Values

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics