计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (3): 401-409.DOI: 10.3778/j.issn.1673-9418.1905022

• 学术研究 • 上一篇    下一篇

价值样本选取的不均衡分类

徐剑,王馨月,才子昕,沈启航,景丽萍   

  1. 北京交通大学 计算机与信息技术学院,北京 100044
  • 出版日期:2020-03-01 发布日期:2020-03-13

Imbalance Classification Based on Informative Instances Selection

XU Jian, WANG Xinyue, CAI Zixin, SHEN Qihang, JING Liping   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Online:2020-03-01 Published:2020-03-13

摘要:

基于传统模型的实际分类问题,不均衡分类是一个常见的挑战问题。由于传统分类器较难学习少数类数据集内部的本质结构,导致更多地偏向于多数类,从而使少数类样本被误分为多数类样本。与此同时,样本集中的冗余数据和噪音数据也会对分类器造成困扰。为有效处理上述问题,提出一种新的不均衡分类框架SSIC,该框架充分考虑数据统计特性,自适应从大小类中选取有价值样本,并结合代价敏感学习构建不均衡数据分类器。首先,SSIC通过组合部分多数类实例和所有少数类实例来构造几个平衡的数据子集。在每个子集上,SSIC充分利用数据的特征来提取可区分的高级特征并自适应地选择重要样本,从而可以去除冗余噪声数据。其次,SSIC通过在每个样本上自动分配适当的权重来引入一种代价敏感的支持向量机(SVM),以便将少数类视为与多数类相等。

关键词: 类的不均衡学习, 分类, 压缩激励网络, 代价敏感度学习

Abstract:

Class imbalance is a common challenge issue in practical classification problem for traditional models. Due to traditional learning algorithms can not sufficiently learn the hidden patterns from the minority classes and may be biased towards majority classes, thus minority instances are usually misclassified into majority instances. Moreover, redundant data and noise data in the dataset can also cause problems for the classifier. To deal with the above problems, this paper proposes a new imbalance classification framework SSIC. The framework fully considers the statistical properties of dataset, adaptively selects valuable instances from the different classes, and combines cost-sensitive learning to construct an imbalance classifier. Firstly, SSIC constructs several balanced data subsets by combining partial majority-class instances and all minority-class instances. On each subset, SSIC sufficiently takes advantage of the characteristics of data to extract the discriminative high-level features and adaptively selects the impor-tant samples, so that the redundant and noise data can be removed. Secondly, SSIC introduces a cost-sensitive support vector machine (SVM) by automatically assigning proper weight on each instance so that the minority class can be treated as equal as the majority class.

Key words: class imbalance learning, classification, squeeze-and-excitation networks, cost-sensitive learning