计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (10): 1990-2001.DOI: 10.3778/j.issn.1673-9418.2006100

• 理论与算法 • 上一篇    下一篇

基于McDiarmid界的概念漂移数据流分类算法

梁斌,李光辉   

  1. 1. 江南大学 人工智能与计算机学院,江苏 无锡 214122
    2. 物联网技术应用教育部工程研究中心,江苏 无锡 214122
  • 出版日期:2021-10-01 发布日期:2021-09-30

Concept Drift Data Stream Classification Algorithm Based on McDiarmid Bound

LIANG Bin, LI Guanghui   

  1. 1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
    2. Engineering Research Center of Internet of Things Technology Application, Ministry of Education, Wuxi, Jiangsu 214122, China
  • Online:2021-10-01 Published:2021-09-30

摘要:

数据流中的概念漂移会导致已有的分类模型性能显著下降。目前处理概念漂移的数据流分类算法大都只针对单一类型的概念漂移(如突变型、渐变型或重复型等),难以同时适应不同场景。为此,提出了一种新的适于多类型概念漂移的数据流分类算法。该算法通过双层窗口保存当前最新的分类结果,根据模糊集隶属度函数对窗口中数据分配权重并计算加权错误率,然后利用McDiarmid界分析当前窗口和过去窗口内错误率的差异[δ],根据[δ]是否具有显著性检测概念漂移。检测到漂移后,使用半参数对数似然算法检验当前概念是否为过去概念的重现,进而决定是否复用旧分类器。实验结果表明,与以往同类算法相比,所提算法在漂移检测延迟、误报率、分类准确率和运行时间等指标上均有一定优势。

关键词: 概念漂移, 隶属度, 双层模糊窗口, McDiarmid界, 重复概念

Abstract:

Concept drift in data streams can cause significant performance degradation of existing classification models. Most current data stream algorithms for concept drift only aim at a certain type of concept drift (such as abrupt, gradual, or recurring drift), which is difficult to adapt to different scenarios. Therefore, this paper proposes a new data stream algorithm suitable for different types of concept drift. The proposed algorithm saves the latest classification results through a two-layer window, assigns weights to it based on the membership function and calculates the weighted error rate. Then the McDiarmid bound is used to analyze the difference [δ] between the error rates of current window and the past window, and the concept drift is detected according to the significance of[δ]. After detecting drift, the semi-parametric log-likelihood algorithm is used to check whether the current new concept is a recurrence of the past concept, and then whether to reuse the old classifier is decided. Experimental results show that, the proposed algorithm outperforms the similar existing algorithms in terms of average detecting delay, false positive rate, classification accuracy and running time.

Key words: concept drift, membership degree, double fuzzy window, McDiarmid bound, recurring concept