计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (10): 1900-1911.DOI: 10.3778/j.issn.1673-9418.2006079

• 数据库技术 • 上一篇    下一篇

带有可信度标记的增量式数据修复方法研究

黄慧,李海林   

  1. 1. 三江学院 计算机科学与工程学院,南京 210012
    2. 南京航空航天大学 电子与信息工程学院,南京 211100
  • 出版日期:2021-10-01 发布日期:2021-09-30

Research on Increased Data Repair with Confidence Value Token

HUANG Hui, LI Hailin   

  1. 1. College of Computer Science and Engineering, Sanjiang University, Nanjing 210012, China
    2. College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211100, China
  • Online:2021-10-01 Published:2021-09-30

摘要:

大数据时代,数据蕴含着巨大价值,成为当今信息社会的重要战略资源。然而,在对数据进行加工、处理的过程中,产生了大量不一致数据,对企业决策造成了不可预知的恶劣影响。现有的工作主要基于函数依赖研究数据修复技术,已有的修复方法分为三类:前两类需要企业提供Master数据库或给定元组的可信度值,然而在实际应用中,这样的条件未必能满足;而第三类基于最少删除原则的修复方法又会造成信息的丢失。同时,当函数依赖[X→Y]存在冲突时,现有的方法仅支持修改Y属性值。针对以上不足,在没有给定元组可信度的情形下,提出了带有可信度标记的增量式数据修复方法,方法分为两部分:第一部分为通过分析操作日志和知识规则,自动生成单元格的可信度标记;第二部分包含增量式修复策略,依据可信度标记值,确定修复[X]或[Y]属性值,同时结合条件概率来选取目标值进行修复。实验结果表明,所提的修复方法具有较高的可靠性和扩展性。

关键词: 可信度标记(CVT), 增量式数据修复, 函数依赖, 操作日志

Abstract:

In the era of big data, data contain great value and become important strategic resource in today??s information society. However, a large number of inconsistent data occur during the process of data update and management, which causes unpredictable side effects for enterprises. There are three repair methods based on functional dependencies. The first two methods strongly rely on the Master data or confidence value of given tuples provided by enterprises, which are hard to fulfill in real application. And the third kind of repair method based on the minimal deletion principle will cause the loss of information. Moreover, when solving the conflicts of [X→Y], existing methods only support modifying Y attribute. In view of the shortcomings mentioned above, with the situation of missing tuple confidence, this paper proposes an increased data repair with confidence value token, which can be divided into two parts: the first part is to generate confidence value token automatically by analyzing operator log and knowledge rules, and the second part includes an increased repair strategy which can determine the repair of X or Y attributes according to the confidence value token. Meanwhile, the target value is chosen to repair dirty data with the combination of conditional probability. Experimental results show that the proposed method has high reliability and scalability.

Key words: confidence value token (CVT), increased data repair, functional dependencies, operator log