应用信息论的数据导向模式匹配方法

doi:10.3778/j.issn.1673-9418.1305044

计算机科学与探索 ›› 2013, Vol. 7 ›› Issue (9): 819-830.DOI: 10.3778/j.issn.1673-9418.1305044

应用信息论的数据导向模式匹配方法

赵晨露+，申德荣，寇月，聂铁铮，于戈

东北大学信息科学与工程学院，沈阳 110004

出版日期:2013-09-01 发布日期:2013-09-04

Data-Oriented Method of Schema Matching Utilizing Information Theory

ZHAO Chenlu+, SHEN Derong, KOU Yue, NIE Tiezheng, YU Ge

College of Information Science and Engineering, Northeastern University, Shenyang 110004, China

Online:2013-09-01 Published:2013-09-04

摘要/Abstract

摘要： 随着计算机网络的发展，许多复杂庞大的异构数据集应运而生。为了有效利用这些异构数据，通常采用数据集成的方法，其中模式匹配是数据集成的核心技术。然而，许多数据集具有典型的异构性，并可能存在有重复数据、缺失数据、模式信息缺失等问题，导致传统的模式匹配技术无法适用。为此，研究了模式信息未知或者不完整情况下的模式匹配问题，提出了应用信息论的模式匹配模型。该模型完全基于数据分布的特点而不依赖于任何外部知识，能够准确地计算出属性列之间的相似度，并有效地描述数据集中各个属性列数据的分布特点和属性列之间的关联关系。还提出了构建原始数据分布图和演化数据分布图的算法，从而形式化地表达出属性列之间的关系，达到匹配的目的。在真实数据集上的综合实验评估证明了方法的可行性和有效性。

关键词: 模式匹配, 数据导向, 信息论模型

Abstract: The development of network has led to larger and more heterogeneous datasets. In order to employ these heterogeneous data, people usually use data integration, and schema matching is the core technology of data integration. However, these data sets are always with typical heterogeneity and may have problems such as duplicate records, missing values or lack of schema information, which lead to the inapplicability of traditional schema matching technology. For this end, this paper focuses on the study of schema matching in a situation where the schema information is unknown or incomplete, and proposes a schema matching model utilizing information theory. The model is totally based on the characteristics of the data distribution they contain, and without assumption of the existence of any external knowledge. It can compute all the similarities between columns accurately, and describe the characteristics of data distribution of each attribute column and the relations between them. This paper provides algorithms for constructing original data distribution graph and evolutive data distribution graph, which help to describe the relationship between attribute columns formally. The comprehensive experimental evaluation on real datasets verifies the feasibility and effectiveness of the proposed method.

Key words: schema matching, data-oriented, information theory model

赵晨露，申德荣，寇月，聂铁铮，于戈. 应用信息论的数据导向模式匹配方法[J]. 计算机科学与探索, 2013, 7(9): 819-830.

ZHAO Chenlu, SHEN Derong, KOU Yue, NIE Tiezheng, YU Ge. Data-Oriented Method of Schema Matching Utilizing Information Theory[J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(9): 819-830.

[1]	许嘉，张千桢，赵翔，吕品，李陶深. 基于结构分解的动态图增量匹配算法[J]. 计算机科学与探索, 2018, 12(8): 1214-1224.
[2]	郭乐乐，林友芳，韩升. 利用有序互信息匹配包含非透明列的数据模式[J]. 计算机科学与探索, 2017, 11(9): 1389-1397.
[3]	范红杰，柳军飞，周鲁东，麻志毅. 多策略相似度整合的XML模式匹配方法[J]. 计算机科学与探索, 2016, 10(1): 14-24.
[4]	黄冬梅，许坤，张明华. Entropy-Beta：用于模式匹配众包方法中的发包策略[J]. 计算机科学与探索, 2015, 9(7): 887-896.
[5]	陈冲，蒋夏军，张青平. 并行的XML数据流模式匹配算法[J]. 计算机科学与探索, 2015, 9(12): 1439-1449.
[6]	姜芳艽1,2+ ,孟小峰1 . Deep Web数据集成中查询处理的研究与进展[J]. 计算机科学与探索, 2009, 3(2): 113-129.
[7]	聂铁铮,于戈+,申德荣,寇月 . 基于实例的Deep Web数据源结果模式匹配技术[J]. 计算机科学与探索, 2008, 2(6): 601-613.

应用信息论的数据导向模式匹配方法

Data-Oriented Method of Schema Matching Utilizing Information Theory

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics