计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (2): 274-284.DOI: 10.3778/j.issn.1673-9418.1901017

• 人工智能 • 上一篇    下一篇

结合样本局部密度的非平衡数据集成分类算法

杨浩,陈红梅   

  1. 1. 西南交通大学 信息科学与技术学院,成都 611756
    2. 云计算与智能技术高校重点实验室(西南交通大学),成都 611756
  • 出版日期:2020-02-01 发布日期:2020-02-16

Ensemble Classification Algorithm for Imbalanced Data Combined with Local Area Density

YANG Hao, CHEN Hongmei   

  1. 1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China
    2. Key Laboratory of Cloud Computing and Intelligent Technology, Southwest Jiaotong University, Chengdu 611756, China
  • Online:2020-02-01 Published:2020-02-16

摘要:

传统的过采样方法是解决非平衡数据分类问题的有效方法之一。基于SMOTE的过采样方法在数据集出现类别重叠(class-overlapping)和小析取项(small-disjuncts)问题时将降低采样的效果,针对该问题提出了一种基于样本局部密度的过采样算法MOLAD。在此基础上,为了解决非平衡数据的分类问题,提出了一种在采样阶段将MOLAD算法和基于Bagging的集成学习结合的算法LADBMOTE。LADBMOTE首先根据MOLAD计算每个少数类样本的K近邻,然后选择所有的[K]近邻进行采样,生成[K]个平衡数据集,最后利用基于Bagging的集成学习方法将[K]个平衡数据集训练得到的分类器集成。在KEEL公开的20个非平衡数据集上,将提出的LADBMOTE算法与当前流行的7个处理非平衡数据的算法对比,实验结果表明LADBMOTE在不同的分类器上的分类性能更好,鲁棒性更强。

关键词: 非平衡数据, 近邻计算策略, 集成学习, 过采样

Abstract:

Oversampling method is one of the effective ways to deal with imbalance classification problems. This paper focuses on the problems of reducing sampling result faced by the oversampling methods based on SMOTE(synthetic minority oversampling technique) in the occurrence of class-overlapping and small-disjuncts in dataset. An oversampling method MOLAD based on the local area density is proposed. Furthermore, a method LADBMOTE which combines the MOLAD and Bagging-based ensemble learning in sampling stage is proposed in order to solve the classification problem for imbalanced dataset. LADBMOTE first calculates the [K] nearest neighbors of each minority class sample according to MOLAD, and then selects all [K] nearest neighbors for sampling, thus [K] balanced datasets will be generated. Then, the Bagging-based ensemble learning is used to ensemble classifiers obtained from training [K] balanced datasets. The proposed method LADBMOTE is compared with 7 currently popular algorithms for handling imbalanced data by employing 20 imbalanced datasets published on KEEL. The experimental results show that the classification performance of LADBMOTE on different classifiers is better and more robust.

Key words: imbalanced data, strategy for calculating nearest neighbors, ensemble learning, oversampling