计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (6): 975-984.DOI: 10.3778/j.issn.1673-9418.1905091

• 人工智能 • 上一篇    下一篇

构造性覆盖算法的SMOTE过采样方法

严远亭,朱原玮,吴增宝,张以文,张燕平   

  1. 安徽大学 计算机科学与技术学院,合肥 230601
  • 出版日期:2020-06-01 发布日期:2020-06-04

Constructive Covering Algorithm-Based SMOTE Over-sampling Method

YAN Yuanting, ZHU Yuanwei, WU Zengbao, ZHANG Yiwen, ZHANG Yanping   

  1. School of Computer Science and Technology, Anhui University, Hefei 230601, China
  • Online:2020-06-01 Published:2020-06-04

摘要:

如何提高对少数类样本的识别能力是不平衡数据分类中的一个研究热点。合成少数类过采样技术(SMOTE)是解决此类问题的代表性方法之一。近年来,不少研究者对SMOTE做出了一些改进,较好地提高了该方法的性能。然而,如何有效地选取典型少数类样本进行过采样仍然是一个值得研究的问题。此外,被孤立的少数样本在提高模型性能方面的潜在能力也没有得到足够的重视。针对上述问题,提出了基于构造性覆盖算法(CCA)的过采样技术CMOTE。CMOTE提供了两种不同策略下选择关键样本的方法:基于覆盖内样本个数的方法与基于覆盖密度的方法。在12个典型的不平衡数据集上验证CMOTE算法的性能。实验结果表明,CMOTE算法在总体上优于对比方法,并且通过强化关键样本对模型性能的影响增强了模型的泛化能力。

关键词: 不平衡数据, 过采样技术, 合成少数类过采样技术(SMOTE), 构造性覆盖算法(CCA)

Abstract:

Improving the recognition ability of minority samples is a crucial research hotspot of imbalance data classification. Synthetic minority over-sampling technique (SMOTE) is a typical representative technique to solve such problem. In recent years, researchers have made some improvements on SMOTE, and the performance of this method is improved. However, how to select the most informative minority samples efficiently for over-sampling still needs to be improved. Moreover, the potential ability of isolate minority samples in improving model performance does??t get enough attention. In this paper, an over-sampling technique based on constructive covering algorithm (CCA) and SMOTE (namely CMOTE) is proposed to solve the above problems. CMOTE provides two CCA based strategies (one is based on the number of samples in cover and one is based on cover density) in selecting key samples. Numerical experiments on 12 typical imbalance datasets are conducted to verify the performance of CMOTE. Experimental results show that CMOTE is generally superior to the algorithms compared. The generalization ability of the model is enhanced by strengthening the impact of critical samples on model performance.

Key words: imbalanced data, over-sampling technique, synthetic minority over-sampling technique (SMOTE), cons-tructive covering algorithm (CCA)