Multi-source Online Transfer Learning Algorithm for Imbalanced Data

doi:10.3778/j.issn.1673-9418.2106049

Abstract

Abstract: Multi-source online transfer learning uses the labeled data of multiple source domains to enhance the classification performance of the target domain. Aiming at the imbalanced datasets, this paper proposes a multi-source online transfer learning algorithm which can oversample in the feature space of the source domain and the target domain. The algorithm consists of two parts: oversampling multiple source domains and oversampling online target domains. In the oversampling stage of source domain, the feature space of the support vector machine (SVM) classifier is oversampled to generate a few class samples. The new samples are obtained by amplifying the original Gram matrix from the neighborhood information in the feature space of the source domain. In the oversampling phase of target domain, the samples of target domain arrive in batches, and a few samples of current batch look for the k-nearest neighbor in the feature space from the previous batches, and the target domain function is trained with the generated new samples and the original samples of current batch. Through the kernel function, the samples of the source domain and the target domain are mapped into the same feature space for oversampling, and the corres-ponding decision functions are trained by using the data of source domain and target domain with relatively balan-ced class distribution, so as to improve the overall performance of the algorithm. Experiments are carried out on four real datasets. On the widely used Office-Home dataset, the accuracy is improved by 0.0311 and the G-mean value is improved by 0.0702 compared with other baseline algorithms.

Key words: multi-source transfer learning, online learning, imbalanced data, feature space, support vector machine (SVM), k-nearest neighbor, kernel function

摘要： 多源在线迁移学习利用多个源域的标记数据来增强目标域的分类性能，针对不平衡的数据集，提出一种可以在源域和目标域的特征空间中过采样的多源在线迁移学习算法。该算法包含两部分：对多个源域过采样和对在线的目标域过采样。对源域过采样阶段，在支持向量机（SVM）的特征空间中过采样来生成少数类样本，新的样本是通过在源域特征空间中的邻域信息来扩增原始的Gram矩阵得到的。对在线的目标域过采样阶段，目标域的样本分批次到达，当前批次的少数类样本从前面已经到达的多个批次中寻找特征空间中的[k]近邻，利用生成的新样本和当前批次中的原始样本一同训练目标域函数。通过核函数将源域和目标域的样本映射到同一特征空间中进行过采样，使用类别分布相对平衡的源域和目标域数据训练相应的决策函数，从而提升算法的整体性能。在四个真实数据集上进行了全面的实验，在Office-Home数据集的任务上相较其他基线算法，准确率提升了0.031 1，G-mean值提升了0.070 2。

关键词: 多源迁移学习, 在线学习, 不平衡数据, 特征空间, 支持向量机（SVM）, k近邻, 核函数

ZHOU Jingyu, WANG Shitong. Multi-source Online Transfer Learning Algorithm for Imbalanced Data[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(3): 687-700.

周晶雨, 王士同. 对不平衡数据的多源在线迁移学习算法[J]. 计算机科学与探索, 2023, 17(3): 687-700.

References

[1] PEILIN Z, STEVEN C H H, JIALEI W, et al. Online trans-fer learning[J]. Artificial Intelligence, 2014, 216: 76-102.
[2] HANRUI W, YUGUANG Y, YUZHONG Y, et al. Online heterogeneous transfer learning by knowledge transition[J]. ACM Transactions on Intelligent Systems and Technology, 2019, 10(3): 26.
[3] PAN S J, YANG Q. A survey on transfer learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
[4] JIE L, VAHID B, PEND H, et al. Transfer learning using com-putational intelligence: a survey[J]. Knowledge-Based Sys-tems, 2015, 80: 14-23.
[5] 赵鹏飞, 李艳玲, 林民. 面向迁移学习的意图识别研究进展[J]. 计算机科学与探索, 2020, 14(8): 1261-1274.
ZHAO P F, LI Y L, LIN M. Research progress of intention recognition for transfer learning[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(8): 1261-1274.
[6] 任豪, 刘柏嵩, 孙金杨. 面向知识迁移的跨领域推荐算法研究进展[J]. 计算机科学与探索, 2020, 14(11): 1813-1827.
REN H, LIU B S, SUN J Y. Research progress of cross domain recommendation algorithms for knowledge transfer[J]. Journal of Frontiers of Computer Science and Techno-logy, 2020, 14(11): 1813-1827.
[7] DAI W Y, YANG Q, XUE G R, et al. Boosting for transfer learning[C]//Proceedings of the 24th International Conference on Machine learning, Corvallis, Jun 20-24, 2007. New York: ACM, 2007: 193-200.
[8] LONG M, WANG J, DING G, et al. Adaptation regulariza-tion: a general framework for transfer learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(5): 1076-1089.
[9] YAO Y, DORETTO G. Boosting for transfer learning with multiple sources[C]//Proceedings of the 23rd IEEE Confe-rence on Computer Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. Washington: IEEE Computer So-ciety, 2010: 1855-1862.
[10] AMINI M R, USUNIER N, GOUTTE C. Learning from mul-tiple partially observed views—an application to multi-lingual text categorization[C]//Proceedings of the 23rd Annual Conference on Neural Information Processing Systems 2009, Vancouver, Dec 7-10, 2009. Red Hook: Curran Assoc-iates, 2009: 28-36.
[11] EATON E. Selective transfer between learning tasks using task-based boosting[C]//Proceedings of the 25th AAAI Con-ference on Artificial Intelligence. Menlo Park: AAAI Press, 2011: 337-342.
[12] DREDZE M, KULESZA A, CRAMMER K. Multi-domain learning by confidence-weighted parameter combination[J]. Machine Learning, 2010, 79(1/2): 123-149.
[13] PENG X C, BAI Q X, XIA X D, et al. Moment matching for multi-source domain adaptation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vi-sion, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 1406-1415.
[14] HOFFMAN J, MOHRI M, ZHANG N S. Algorithms and theory for multiple-source adaptation[C]//Proceedings of the Annual Conference on Neural Information Processing Sys-tems 2018, Montréal, Dec 3-8, 2018: 8256-8266.
[15] YAN Y G, WU Q Y, TAN M K, et al. Online heterogeneous transfer by hedge ensemble of offline and online decisions[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(7): 3252-3263.
[16] 孙勇, 谭文安, 谢娜, 等. 面向大规模服务性能预测的在线学习方法[J]. 计算机科学与探索, 2017, 11(12): 1922-1930.
SUN Y, TAN W A, XIE N, et al. Online learning method for performance prediction of large scale services[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(12): 1922-1930.
[17] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[18] VAPNIK V N. The nature of statistical learning theory[M]. Berlin, Heidelberg: Springer, 1995.
[19] KHEMCHANDANI R, CHANDRA S. Twin support vector machines for pattern classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(5): 905-910.
[20] WU Q Y, WU H R, ZHOU X M, et al. Online transfer lear-ning with multiple homogeneous or heterogeneous sources[J]. IEEE Transactions on Knowledge and Data Enginee-ring, 2017, 29(7): 1494-1507.
[21] KANG Z F, YANG B, YANG S T, et al. Online transfer lear-ning with multiple source domains for multi-class classifi-cation[J]. Knowledge-Based Systems, 2020, 190: 105149.
[22] 周晶雨, 王士同. 对不平衡目标域的多源在线迁移学习[J]. 智能系统学报, 2022, 17(2): 248-256.
ZHOU J Y, WANG S T. Multi-source online transfer lear-ning for imbalanced target domain[J]. CAAI Transactions on Intelligent Systems, 2022, 17(2): 248-256.
[23] CRAMMER K, DEKEL O, KESHET J, et al. Online passive-aggressive algorithms[J]. Journal of Machine Learning Re-search, 2006, 7: 551-585.
[24] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[25] MATHEW J, PANG C K, LUO M, et al. Classification of imbalanced data by oversampling in kernel space of support vector machines[J]. IEEE Transactions on Neural Networks & Learning Systems, 2018, 29(9): 4065-4076.
[26] VENKATESWARA H, EUSEBIO J, CHAKRABORTY S, et al. Deep hashing network for unsupervised domain adapta-tion[C]//Proceedings of the 2017 IEEE Conference on Com-puter Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 5385-5394.
[27] RINGWALD T, STIEFELHAGEN R. Adaptiope: a modern benchmark for unsupervised domain adaptation[C]//Pro-ceedings of the 2021 IEEE Winter Conference on Applica-tions of Computer Vision, Waikoloa, Jan 3-8, 2021. Piscata-way: IEEE, 2021: 101-110.