计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (12): 1950-1960.DOI: 10.3778/j.issn.1673-9418.1807036

• 人工智能与模式识别 • 上一篇    下一篇

结合局部敏感哈希和随机游走的异常检测算法

舒敏,刘华文,郑忠龙,徐晓丹   

  1. 浙江师范大学 数理与信息工程学院,浙江 金华 321004
  • 出版日期:2018-12-01 发布日期:2018-12-07

Outlier Detection Algorithm Combining Locality Sensitive Hashing and Random Walks

SHU Min, LIU Huawen, ZHENG Zhonglong, XU Xiaodan   

  1. College of Mathematics, Physics and Information Engineering, Zhejiang Normal University, Jinhua, Zhejiang 321004, China
  • Online:2018-12-01 Published:2018-12-07

摘要:

异常检测是数据挖掘的主要研究热点问题之一。目前已存在很多异常检测的方法,但是现存的主要异常检测方法在高维数据处理过程中效率较低。为解决此问题,提出了一种高效的异常检测算法。该算法结合局部敏感哈希的性质和图的随机游走来识别异常点。具体而言,通过局部敏感哈希实现对高维数据的高效处理,随后利用数据之间距离获取其相似性,并将其转化为随机游走的转移概率。在此基础上,使用随机游走技术计算数据之间的游走概率,其中正常数据之间的转移概率越来越高,而异常点的概率越来越低,进而根据此性质最终辨别异常数据。实验结果表明,提出的方法能有效检测出数据中的异常,总体上优于其他异常检测算法。

关键词: 异常点检测, 局部敏感哈希, 随机游走, 数据挖掘

Abstract:

Outlier detection is one of the main research hotspots of data mining. There are many outlier detection methods, but the existing main outlier detection methods are inefficient in high dimensional data processing. In order to solve this problem, this paper proposes an efficient outlier detection algorithm. The algorithm combines the properties of locality sensitive Hashing and the random walks of the graph to identify outliers. Specifically, locality sensitive Hashing is used to efficiently process high-dimensional data, then the similarity is obtained by using the distance between data, and it is transformed into a transition probability of random walks. On this basis, the random walks technology is used to calculate the transfer probability between the data, in which the transfer probability between normal data will be higher and higher, and the probability of anomaly will be lower and lower, and then the anomaly data can be finally distinguished according to this property. The experimental results show that the pro-posed method can effectively detect outliers in the data and is generally better than other outlier detection algorithms.

Key words: outlier detection, locality sensitive Hashing, random walks, data mining