计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (9): 1044-1055.DOI: 10.3778/j.issn.1673-9418.1411043

• 数据库技术 • 上一篇    下一篇

大数据上基于Hadoop的不一致数据检测与修复算法

张安珍,门雪莹,王宏志+,李建中,高  宏   

  1. 哈尔滨工业大学 计算机科学与技术学院,哈尔滨 150001
  • 出版日期:2015-09-01 发布日期:2015-12-11

Hadoop-Based Inconsistence Detection and Reparation Algorithm for Big Data

ZHANG Anzhen, MEN Xueying, WANG Hongzhi+, LI Jianzhong, GAO Hong   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Online:2015-09-01 Published:2015-12-11

摘要: 随着现代社会互联网的普及应用,产生的海量数据普遍存在质量问题。针对数据质量中不一致性问题进行研究,设计并实现了基于Hadoop并行平台的不一致数据检测与修复算法。采用数据依赖理论中的条件函数依赖,根据给定规则检测不一致数据集,对这些不一致数据求解修复方案,使得修复结果满足数据一致性要求,并给出修复结果的确定性概率。最后通过实验证明了该算法较已有的单机算法有更好的修复效果,当约束规则较少的情况下,算法执行时间呈线性增长。

关键词: 数据一致性, MapReduce, 条件函数依赖, 数据质量

Abstract: With the popularity of the Internet applications in modern society, there comes the problem of increasing poor quality data. This paper investigates inconsistency problem in data quality, designs and realizes an inconsistent data detection and reparation algorithm based on Hadoop. By using the conditional functional dependency (CFD) rules in the data dependency theory, inconsistent data can be detected according to the given rules, and reparation scheme is proposed for the inconsistent data, the final reparation dataset, whose deterministic probability has been calculated, satisfies the consistent requirement. At last, this paper proves that the algorithm performs better than those on a single computer through experiments and the runtime grows linearly when the rules are not large.

Key words: data inconsistency, MapReduce, conditional functional dependency, data quality