计算机科学与探索 ›› 2013, Vol. 7 ›› Issue (7): 630-638.DOI: 10.3778/j.issn.1673-9418.1305012

• 学术研究 • 上一篇    下一篇

集成降采样不平衡数据分类方法研究

郭丽娟,倪子伟,江  弋,邹  权+   

  1. 厦门大学 信息科学与技术学院,福建 厦门 361005
  • 出版日期:2013-07-01 发布日期:2013-07-02

Research on Imbalanced Data Classification Based on Ensemble and Under-Sampling

GUO Lijuan, NI Ziwei, JIANG Yi, ZOU Quan+   

  1. School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
  • Online:2013-07-01 Published:2013-07-02

摘要: 对不平衡数据分类问题进行了研究,提出了两种基于采样的不平衡数据分类方法:一种是采用FarthestFirst聚类降采样,另一种是对样本进行带权重的随机抽样,两种方法均获得了较佳的分类效果。提出了样本带权重随机抽样与分类器集成相结合的不平衡数据分类方法。该方法对训练集的小类样本分别加各种权重,再与大类样本分别合并后进行带权重的随机抽样,生成N份平衡的数据集,分别对基分类器进行训练,最终投票集成组合分类器。实验结果表明,训练集划分与分类器集成相结合的不平衡数据分类方法具有更好的分类效果。

关键词: 不平衡分类, 预处理, 集成学习

Abstract: This paper studies the imbalanced data classification problem, and proposes two sampling methods for the imbalanced data classification. One is under-sampling by FarthestFirst clustering; the other is weighted random sampling. Both of them obtain better performance. Then this paper proposes a novel imbalanced data classification method, combining weighted random sampling with ensemble classifiers. In this method, the small samples are set various weights, and merged with large samples into new datasets. With a weighted random sampling for each new dataset, N balanced datasets can be got. These balanced datasets are trained with different classifiers, which will vote for the last result. Experiments show that this method has better classification performance.

Key words: imbalanced classification, preprocessing, ensemble learning