Journal of Frontiers of Computer Science and Technology ›› 2015, Vol. 9 ›› Issue (10): 1180-1194.DOI: 10.3778/j.issn.1673-9418.1505080

Previous Articles     Next Articles

Design of Evaluating Sequential Data Quality with Gap Constraint

WANG Huifeng1, DUAN Lei1,2+, HU Bin3, DENG Song4, WANG Wentao1, QIN Pan1   

  1. 1. School of Computer Science, Sichuan University, Chengdu 610065, China
    2. West China School of Public Health, Sichuan University, Chengdu 610041, China
    3. Smart Grid Research Institute, State Grid, Nanjing 210003, China
    4. Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
  • Online:2015-10-01 Published:2015-09-29

带间隔约束的序列数据质量评价算法设计

王慧锋1,段磊1,2+,胡斌3,邓松4,王文韬1,秦攀1   

  1. 1. 四川大学 计算机学院,成都 610065
    2. 四川大学 华西公共卫生学院,成都 610041
    3. 国家电网 智能电网研究院,南京 210003
    4. 南京邮电大学 先进技术研究院,南京 210003

Abstract: Sequential data, which widely exists in real world applications, is an important research topic in data mining. The reliability of the mining results depends on the quality of sequences. Traditional data quality evaluation methods analyze the data quality problem by statistical indicator, but the statistical indicator can?t evaluate the relationship of each element in the unstructured sequence. To detect the quality of a sequence, this paper proposes a quality evaluation algorithm for sequential data using the probability suffix tree. Specifically, under the specified gap constraint, a probability suffix tree is built based on the sequences with reliable quality. Then, the tree is used for evaluating the quality of a query sequence. Finally, experiments on real-world sequence sets confirm the effectiveness, efficiency and scalability of the proposed algorithm.

Key words: data quality, probabilistic suffix tree, gap constraint

摘要: 序列数据广泛存在于实际应用中,因此关于序列数据挖掘的算法研究一直都是热点。序列数据的质量关系到挖掘结果的可靠性,传统的数据质量评价方法多通过统计指标来分析数据的质量问题,但统计指标无法对非结构化序列数据中各元素之间的关系进行评估。为检测序列数据质量,提出了基于概率后缀树模型的序列数据质量评价算法。具体地,在满足间隔约束的前提下,根据数据质量可靠的序列数据样本生成概率后缀树,并根据概率后缀树对查询序列数据进行质量评价。最后通过真实序列数据集验证了算法的有效性、执行效率和伸缩性。

关键词: 数据质量, 概率后缀树, 间隔约束