计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (7): 869-876.DOI: 10.3778/j.issn.1673-9418.1409037

• 人工智能与模式识别 • 上一篇    下一篇

基于核空间中K-近邻的不均衡数据算法

杜红乐+   

  1. 商洛学院 数学与计算机应用学院,陕西 商洛 726000
  • 出版日期:2015-07-01 发布日期:2015-07-07

Algorithm for Imbalanced Dataset Based on K-Nearest Neighbor in Kernel Space

DU Hongle+   

  1. School of Mathematics and Computer Application, Shangluo University, Shangluo, Shaanxi 726000, China
  • Online:2015-07-01 Published:2015-07-07

摘要: 为了解决传统分类器的过拟合现象,从而增强分类性能,提出了一种基于核空间中K-近邻算法的混合取样的不均衡数据集分类算法。该算法首先在核空间上计算样本与相反类样本的k个近邻,以及类样本间的平均距离,即两个类中心间的距离;然后依据控制参数删除远离分类边界的样本,再对少数类利用SMOTE算法插入样本;最后在新的训练集上确定最终决策函数。在人工数据集和4组UCI数据集上进行了实验,结果表明了该算法对不均衡数据集进行降维采样的有效性。

关键词: 支持向量机, 不均衡数据, 过取样, 欠取样, K-近邻

Abstract: In order to resolve the over fitting phenomenon of classifiers and enhance classification performance, this paper proposes an under-sampling method for imbalanced data classification based on K-nearest neighbor in kernel space. Firstly, this algorithm computes the k nearest neighbors of samples and contrary class samples in kernel space, and computes the average distance between two class samples. Then, this algorithm deletes the samples away from the classification boundary according to the control parameters, and uses the SMOTE over-sampling algorithm for small class samples to generate a new balanced sample set. Finally, this algorithm gets the final decision function with the new dataset. The algorithm may resolve the problem of imbalanced dataset and improve the classification performance of SVM. The experimental results with artificial dataset and four groups of UCI datasets show that the algorithm is effective for imbalanced dataset.

Key words: support vector machine, imbalanced dataset, over-sampling, under-sampling, K-nearest neighbor