计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (10): 2310-2319.DOI: 10.3778/j.issn.1673-9418.2105040

• 人工智能 • 上一篇    下一篇

使用多分类器的分布式模型重用技术

李新春1,3, 詹德川2,3,+()   

  1. 1.南京大学 计算机科学与技术系,南京 210023
    2.南京大学 人工智能学院,南京 210023
    3.南京大学 计算机软件新技术国家重点实验室,南京 210023
  • 收稿日期:2021-05-06 修回日期:2021-06-22 出版日期:2022-10-01 发布日期:2021-06-24
  • 通讯作者: + E-mail: zhandc@nju.edu.cn
  • 作者简介:李新春(1997—),男,江苏徐州人,硕士研究生,主要研究方向为机器学习、数据挖掘。
    詹德川(1982—),男,江苏扬州人,博士,教授,主要研究方向为机器学习、数据挖掘。
  • 基金资助:
    国家自然科学基金(61773198);国家自然科学基金(61632004)

Distributed Model Reuse with Multiple Classifiers

LI Xinchun1,3, ZHAN Dechuan2,3,+()   

  1. 1. Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
    2. School of Artificial Intelligence, Nanjing University, Nanjing 210023, China
    3. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Received:2021-05-06 Revised:2021-06-22 Online:2022-10-01 Published:2021-06-24
  • About author:LI Xinchun, born in 1997, M.S. candidate. His research interests include machine learning and data mining.
    ZHAN Dechuan, born in 1982, Ph.D., professor. His research interests include machine learning and data mining.
  • Supported by:
    National Natural Science Foundation of China(61773198);National Natural Science Foundation of China(61632004)

摘要:

传统的机器学习经常采用数据中心化的方式进行训练,然而由于实际应用中的传输开销或者隐私保护限制,数据越来越呈现分散化、隔离化的趋势。分布式训练学习技术为分散在信息孤岛上的数据融合提供了一种解决方案。然而,由于分散化数据本身具有天然异质性,本地数据分布经常是非独立同分布的(Non-IID),这给分布式训练带来了挑战。首先,为了应对单一模型难以适配所有异质客户端的难题,在分布式训练的基础上引入了模型重用技术,提出了分布式模型重用框架(DMR)。然后,通过理论分析指出集成学习可以为异构数据提供有效的解决方案,并在此基础之上提出了使用多分类器的分布式模型重用技术(McDMR)。最后,为了减少实际应用过程中的存储、计算和传输开销,继而提出了两种具体的优化方案:使用多头分类器的分布式模型重用(McDMR-MH)和使用随机分类器采样的分布式模型重用(McDMR-SC)。在多个公开数据集上进行实验,实验结果验证了所提方法的有效性。

关键词: 学件, 模型重用, 多分类器, 分布式学习, 集成, 效率, 隐私保护

Abstract:

Traditional machine learning always takes a data centralized training strategy, while the transmission cost or data privacy protection in many real-world applications results in distributed and isolated data. Distributed learning provides an effective solution for efficient data fusion across isolated islands. However, due to the natural heterogeneity in real-world applications, the distributions of local data are not independently and identically distributed (Non-IID), which poses a huge challenge to distributed learning. First of all, to overcome the problem of data heterogeneity across local clients, this paper introduces model reuse into the procedure of distributed training and proposes a distributed model reuse (DMR) framework. Then, this paper theoretically shows that ensemble learning can provide a universal solution to data heterogeneity, and proposes a technique of multiple classifiers based distributed model reuse (McDMR). Finally, in order to reduce the storage, computation and transmission cost in practical applications, this paper further proposes two specific solutions including multi-head classifier and stochastic classifier based McDMR, which are named as McDMR-MH and McDMR-SC respectively. Experimental results on several public datasets verify the superiorities of the proposed methods.

Key words: learnware, model reuse, multiple classifiers, distributed learning, ensemble, efficiency, privacy protection

中图分类号: