计算机科学与探索 ›› 2009, Vol. 3 ›› Issue (5): 519-538.DOI: 10.3778/j.issn.1673-9418.2009.05.008

• 学术研究 • 上一篇    下一篇

通过最大化实现数据流算法中的可变滑动窗口

Jan Peter Patist+   

  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-09-15 发布日期:2009-09-15
  • 通讯作者: Jan Peter Patist

Flexible Windowing in Data Stream Algorithms by Maximization

Jan Peter Patist+   

  1. Artificial Intelligence, Vrije Universiteit Amsterdam, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-09-15 Published:2009-09-15
  • Contact: Jan Peter Patist

摘要: 数据流挖掘中很多算法是基于定长滑动窗口的,定长滑动窗口的缺点是很难设置窗口的大小,而且对数据流分布的不同类型不存在最优大小的窗口,因此算法的性能较差。提出了可变滑动窗口算法,通过高效维护一个静态的最大范化均值完成。该常量在全部时间窗口中被最大化因而使用变长窗口。其他算法可以用该方法重新描述。实验表明了范化均值的有效性。

关键词: 泛化均值, 滑动窗口, 数据流

Abstract: Data streams are data sources delivering indefinitely and at high rate new data, which causes computational and storage problems in the analysis of these sources. An inherent assumption in the analysis of a data stream is that the data stream distribution changes over time, which causes old data to be non-representative to the current data distribution. In case our objective is to obtain knowledge of the current data properties, analysis based on old and non-representative data will result in incorrect or incomplete knowledge of the current state. To cope with this problem, many data stream analyses are built on a fixed sliding window of the most “freshest” data points received. In this way, old data is forgotten and the space and time complexity of the analysis algorithm bounded. Due to the fact that no prior knowledge of the eventual changes is available, there exists no optimal window size and the quality of the analysis degrades. To increase the quality of analysis, it proposes some initial steps towards efficiently equipping sliding window algorithms with flexible windowing. Flexible windowing is obtained by the efficient maintenance of a statistic called, the maximum normalized mean. This statistic is used as a building block for algorithms in change detection and frequent item set mining. This paper shows that the maximum normalized mean can be maintained efficiently in time and space. Furthermore, it restates several algorithms in change detection and frequent item set mining, such that the algorithms execution is only dependent on the maintenance of the maximum normalized mean. And most importantly it investigates the gain in performance due to the flexible windowing over a fixed size window.

Key words: normalized mean, sliding window, data stream

中图分类号: