计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (1): 132-140.DOI: 10.3778/j.issn.1673-9418.1909021

• 人工智能 • 上一篇    下一篇

有新类的动态数据流分类算法研究

武炜杰,张景祥   

  1. 江南大学 理学院,江苏 无锡 214122
  • 出版日期:2021-01-01 发布日期:2021-01-07

Research on Dynamic Data Stream Classification Algorithm with New Class

WU Weijie, ZHANG Jingxiang   

  1. School of Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2021-01-01 Published:2021-01-07

摘要:

针对有新类的动态数据流分类算法检测新类性能不高的问题,提出一种基于k近邻的完全随机森林算法(KCRForest)。该算法利用动态数据流中已知类样本构建完全随机森林的完全随机树,并根据叶节点平均路径长度将样本空间分成正常区域与异常区域。通过落入异常区域中样本的k近邻计算该样本离群值。若样本离群值大于设定阈值,则判断样本为新类,否则为已知类。落入异常区域的已知类样本由该样本的k近邻得到样本标签分布,否则取该区域中原训练样本标签分布,投票得到样本标签。当新类样本检测达到一定数量时,利用新类样本信息更新模型,便于检测其他新类。为了验证KCRForest算法检测新类的有效性,分别在4个UCI数据集上进行实验,并与已有算法进行比较。结果表明该算法的新类检测性能优于或与iForest+SVM算法、LOF+SVM算法相当,分类准确率明显高于SENCForest算法。

关键词: 新类检测, 完全随机森林, 动态数据流

Abstract:

Aiming at the low performance in detecting new class of classification algorithm on dynamic data stream with new class, a completely randomized forest algorithm based on k-nearest neighbor (KCRForest) is proposed. The algorithm constructs completely randomized trees of completely randomized forest by only known-class sam-ples in dynamic data stream, and divides the sample space into normal or abnormal region according to the average path length of leaf nodes. The outlier of a sample is obtained based on its k-nearest neighbor, when the sample falls into abnormal region. If the outlier is greater than the set threshold, the sample is judged to be new-class. Otherwise it is judged to be known-class. When the known-class sample falls into abnormal region, class distribution is obtai-ned based on its k-nearest neighbor. Otherwise class distribution can be obtained during training period. The label of known-class sample is identified by voting. When a certain number of new class samples are detected, the model is updated by the new-class sample information to detect other new classes. In order to verify the effectiveness of KCRForest algorithm in detecting new classes, experiments are carried out on 4 UCI datasets respectively, and comparisons are made with existing algorithms. The results show that the proposed method is equivalent to or better than iForest+SVM and LOF+SVM on new-class detection, and its classification accuracy is better than SENCForest.

Key words: new-class detection, completely randomized forest, dynamic data stream