计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (3): 687-700.DOI: 10.3778/j.issn.1673-9418.2106049

• 人工智能·模式识别 • 上一篇    下一篇

对不平衡数据的多源在线迁移学习算法

周晶雨,王士同   

  1. 江南大学 人工智能与计算机学院, 江苏 无锡 214122
  • 出版日期:2023-03-01 发布日期:2023-03-01

Multi-source Online Transfer Learning Algorithm for Imbalanced Data

ZHOU Jingyu, WANG Shitong   

  1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2023-03-01 Published:2023-03-01

摘要: 多源在线迁移学习利用多个源域的标记数据来增强目标域的分类性能,针对不平衡的数据集,提出一种可以在源域和目标域的特征空间中过采样的多源在线迁移学习算法。该算法包含两部分:对多个源域过采样和对在线的目标域过采样。对源域过采样阶段,在支持向量机(SVM)的特征空间中过采样来生成少数类样本,新的样本是通过在源域特征空间中的邻域信息来扩增原始的Gram矩阵得到的。对在线的目标域过采样阶段,目标域的样本分批次到达,当前批次的少数类样本从前面已经到达的多个批次中寻找特征空间中的[k]近邻,利用生成的新样本和当前批次中的原始样本一同训练目标域函数。通过核函数将源域和目标域的样本映射到同一特征空间中进行过采样,使用类别分布相对平衡的源域和目标域数据训练相应的决策函数,从而提升算法的整体性能。在四个真实数据集上进行了全面的实验,在Office-Home数据集的任务上相较其他基线算法,准确率提升了0.031 1,G-mean值提升了0.070 2。

关键词: 多源迁移学习, 在线学习, 不平衡数据, 特征空间, 支持向量机(SVM), k近邻, 核函数

Abstract: Multi-source online transfer learning uses the labeled data of multiple source domains to enhance the classification performance of the target domain. Aiming at the imbalanced datasets, this paper proposes a multi-source online transfer learning algorithm which can oversample in the feature space of the source domain and the target domain. The algorithm consists of two parts: oversampling multiple source domains and oversampling online target domains. In the oversampling stage of source domain, the feature space of the support vector machine (SVM) classifier is oversampled to generate a few class samples. The new samples are obtained by amplifying the original Gram matrix from the neighborhood information in the feature space of the source domain. In the oversampling phase of target domain, the samples of target domain arrive in batches, and a few samples of current batch look for the k-nearest neighbor in the feature space from the previous batches, and the target domain function is trained with the generated new samples and the original samples of current batch. Through the kernel function, the samples of the source domain and the target domain are mapped into the same feature space for oversampling, and the corres-ponding decision functions are trained by using the data of source domain and target domain with relatively balan-ced class distribution, so as to improve the overall performance of the algorithm. Experiments are carried out on four real datasets. On the widely used Office-Home dataset, the accuracy is improved by 0.0311 and the G-mean value is improved by 0.0702 compared with other baseline algorithms.

Key words: multi-source transfer learning, online learning, imbalanced data, feature space, support vector machine (SVM), k-nearest neighbor, kernel function