Journal of Frontiers of Computer Science and Technology ›› 2010, Vol. 4 ›› Issue (9): 859-864.DOI: 10.3778/j.issn.1673-9418.2010.09.009

• 学术研究 • Previous Articles    

Adaptive Clustering Algorithm for Mining Subspace Clusters in High-Dimensio¬nal Data Stream*

REN Jiadong1,2, ZHOU Weiwei1+, HE Haitao1   

  1. 1. College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China
    2. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-09-09 Published:2010-09-09
  • Contact: ZHOU Weiwei

高维数据流的自适应子空间聚类算法

任家东1,2, 周玮玮1+, 何海涛1   

  1. 1. 燕山大学 信息科学与工程学院, 河北 秦皇岛 066004
    2. 北京理工大学 计算机科学技术学院, 北京 100081
  • 通讯作者: 周玮玮

Abstract: Clustering high-dimensional data streams is a research focused on the area of data mining. As the data stream is large volume, rapidly, high-dimensional, many clustering algorithms cannot achieve good clustering quali¬ty. This paper proposes a new adaptive clustering algorithm for mining subspace clusters in high-dimensional data stream, called SAStream. It improves the cluster structure in HPStream and defines the candidate clusters. The algorithm only computes the distance between the newly coming data points and the centroids of the candidate clusters instead of all clusters, so the number of examined clusters is reduced during clustering process. The created clusters are stored in pyramidal time frame and time fading function is used to discount the history of past behavior. When the data rate is fast, the LimitingRadius and cluster selection factor adjust automatically, and the clustering granularity adjusts all along. The experimental results show that the algorithm can group well with high speed.

Key words: high-dimensional data stream, subspace clustering, data rate, adaptive

摘要: 高维数据流聚类是数据挖掘领域中的研究热点。由于数据流具有数据量大、快速变化、高维性等特点, 许多聚类算法不能取得较好的聚类质量。提出了高维数据流的自适应子空间聚类算法SAStream。该算法改进了HPStream中的微簇结构并定义了候选簇, 只在相应的子空间内计算新来数据点到候选簇质心的距离, 减少了聚类时被检查微簇的数目, 将形成的微簇存储在金字塔时间框架中, 使用时间衰减函数删除过期的微簇; 当数据流量大时, 根据监测的系统资源使用情况自动调整界限半径和簇选择因子, 从而调节聚类的粒度。实验结果表明, 该算法具有良好的聚类质量和快速的数据处理能力。

关键词: 高维数据流, 子空间聚类, 数据流流量, 自适应

CLC Number: