计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (2): 172-181.DOI: 10.3778/j.issn.1673-9418.1405050

• 数据库技术 • 上一篇    下一篇

不确定数据流上Top-k异常点查询算法

曹科研+,王国仁,韩东红,李硕儒   

  1. 东北大学 信息科学与工程学院,沈阳 110819
  • 出版日期:2015-02-01 发布日期:2015-02-03

Top-k Outlier Detection Algorithm on Uncertain Data Stream

CAO Keyan+, WANG Guoren, HAN Donghong, LI Shuoru   

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
  • Online:2015-02-01 Published:2015-02-03

摘要: 近几年,随着数据流和不确定数据的产生,不确定数据流上的异常点检测成为新的研究热点。然而,现有的不确定数据的异常点定义中涉及3个参数,这对于用户是非常难设定的,以致不能查询到适合的异常点。在大多时候,用户更想知道最可能是异常点的对象,因此提出了不确定数据流上的top-k异常点查询算法。该算法通过估计数据对象异常点的概率范围而进行剪枝,从而减少了一些不必要的计算,同时增量地计算数据对象异常点的概率范围。在真实数据集和合成数据集上进行了一系列的模拟实验,证明了算法的性能。

关键词: 不确定数据, 数据挖掘, 异常点, top-k

Abstract: In recent years, along with the appearance of uncertain data, outlier detection on uncertain data stream becomes a new hotspot. However, three parameters are contained in the existing definition of outlier on uncertain data, it is very difficult for users to set these parameters, the user cannot get the suitable outlier. Most of the time, the users would like to get the objects which are most likely to be outliers. This paper proposes the top-k outlier detection on uncertain data stream. The proposed method prunes objects based on the estimation of the range of probabilities being outlier and reduces some unnecessary computation. Meanwhile, this paper proposes the incremental method for computing the range of probabilities to improve efficiency. Finally, the performance of the proposed method is verified through a number of simulation experiments on real and synthetic datasets.

Key words: uncertain data, data mining, outlier, top-k