计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (10): 1776-1786.DOI: 10.3778/j.issn.1673-9418.1911021

• 理论与算法 • 上一篇    下一篇

利用采样安全系数的多类不平衡过采样算法

董明刚,刘明,敬超   

  1. 1. 桂林理工大学 信息科学与工程学院,广西 桂林 541004
    2. 广西嵌入式技术与智能系统重点实验室,广西 桂林 541004
  • 出版日期:2020-10-01 发布日期:2020-10-12

Sampling Safety Coefficient for Multi-class Imbalance Oversampling Algorithm

DONG Minggang, LIU Ming, JING Chao   

  1. 1. School of Information Science and Engineering, Guilin University of Technology, Guilin, Guangxi 541004, China
    2. Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin, Guangxi 541004, China
  • Online:2020-10-01 Published:2020-10-12

摘要:

传统的过采样算法在处理多类不平衡问题时容易出现过度泛化和类别重叠,从而降低了分类性能。为了提高多类不平衡学习性能,提出了一种利用采样安全系数的多类不平衡过采样(SSCMIO)算法。首先为了防止过度泛化,采用近邻采样安全系数为那些会造成过度泛化的邻域分配一个较小的权重。然后考虑到样本点的全局特性,采用反向近邻采样安全系数防止新合成的样本点侵入到其他类别区域,减轻类别之间的重叠问题。最后以C4.5决策树作为基分类器,将SSCMIO算法与7种典型的过采样算法进行了对比实验。在16个公开的真实数据集上,SSCMIO算法在准确率、召回率、F-measure、MG、MAUC这5个指标上均能取得11个以上的最优值,在5个指标上最大提升分别是0.481 8、0.305 3、0.342 0、0.266 4、0.130 7。实验结果表明SSCMIO算法相比其他7种算法可以取得更好的分类性能。

关键词: 采样安全系数, 过采样, 合成少数类技术, 多类不平衡问题

Abstract:

For the problem in multi-class imbalance, traditional oversampling algorithms easily lead to the issue of overgeneralization and overlap result with poor classification performance. To improve the performance of multi-class learning, a sampling safety coefficient for multi-class imbalance oversampling (SSCMIO) algorithm is proposed. First, with the aim of preventing overgeneralization, the neighbor sampling safety coefficient is designed to assign a small weight to those neighborhoods that may cause excessive generalization. Then, by considering the global characteristics of the sample points, the reverse neighbor sampling safety coefficient is presented to prevent new samples that invade into other classes, which alleviates the overlap between classes. Finally, the C4.5 decision tree is used as the base classifier. Compared with 7 representative oversampling algorithms within 16 public real data sets, SSCMIO can obtain more than 11 optimal values on precision, recall, F-measure, MG and MAUC, the maximum increase with the 5 metrics is 0.4818, 0.3053, 0.3420, 0.2664, and 0.1307 respectively. The experimental results show that the SSCMIO algorithm can achieve better classification performance than other 7 algorithms.

Key words: sampling safety coefficient, oversampling, synthetic minority technique, multi-class imbalanced problems