计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (11): 2531-2536.DOI: 10.3778/j.issn.1673-9418.2105043

• 人工智能 • 上一篇    下一篇

随机集成策略迁移

常田, 章宗长+(), 俞扬   

  1. 南京大学 计算机软件新技术国家重点实验室,南京 210023
  • 收稿日期:2021-05-12 修回日期:2021-06-29 出版日期:2022-11-01 发布日期:2021-06-08
  • 通讯作者: + E-mail: zzzhang@nju.edu.cn
  • 作者简介:常田(1995—),男,黑龙江哈尔滨人,硕士研究生,主要研究方向为强化学习。
    章宗长(1985—),男,江西修水人,博士,副教授,主要研究方向为强化学习、智能规划、多智能体系统。
    俞扬(1982—),男,上海金山人,博士,教授,博士生导师,主要研究方向为机器学习、强化学习。
  • 基金资助:
    科技创新2030“新一代人工智能”重大项目(2020AAA0107200);国家自然科学基金(61876119);江苏省自然科学基金面上项目(BK20181432)

Stochastic Ensemble Policy Transfer

CHANG Tian, ZHANG Zongzhang+(), YU Yang   

  1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Received:2021-05-12 Revised:2021-06-29 Online:2022-11-01 Published:2021-06-08
  • About author:CHANG Tian, born in 1995, M.S. candidate. His research interest is reinforcement learning.
    ZHANG Zongzhang, born in 1985, Ph.D., associate professor. His research interests include reinforcement learning, intelligent planning and multi- agent systems.
    YU Yang, born in 1982, Ph.D., professor, Ph.D. supervisor. His research interests include machine learning and reinforcement learning.
  • Supported by:
    Major Program of National Science and Technology Innovation 2030 of China for New Generation of Artificial Intelligence(2020AAA0107200);National Natural Science Foundation of China(61876119);Natural Science Foundation of Jiangsu Province(BK20181432)

摘要:

强化学习(RL)在序列决策问题上取得了巨大的成功。随着强化学习的飞速发展,迁移学习(TL)成为了一种重要的可以通过利用和转移外部知识来加速强化学习的技术。策略迁移是一种外部知识来自教师策略的迁移强化学习方法。现有的策略转移方法要么通过测量源任务与目标任务之间的相似性来转移知识,要么通过估计源策略在目标任务上的性能来选择最佳源策略。但是,性能估计有时可能不可靠,这可能会导致负迁移。针对这种问题,提出了一种新的策略转移方法,称为随机集成策略迁移(SEPT)。SEPT不是在源策略库中选择一个策略,而是利用源策略集成出教师策略来进行迁移。SEPT把策略迁移转变为选项学习问题以便获得终止概率,用终止概率计算出源策略的概率权重,根据概率权重从策略库中集成出教师策略。然后,通过策略蒸馏的方式从教师策略进行知识迁移。实验结果表明SEPT可以有效地加速强化学习训练,并且在离散和连续空间上都能胜过其他最佳的策略迁移方法。

关键词: 迁移学习(TL), 强化学习(RL), 策略迁移, 选项学习, 集成, 策略蒸馏

Abstract:

Reinforcement learning (RL) has achieved great success on sequential decision-making problems. Along with the fast advances of RL, transfer learning (TL) arises as an important technique to accelerate the learning process of RL by leveraging and transferring external knowledge. Policy transfer is a kind of transfer learning approach, in which the external knowledge is teacher policies from source tasks. Existing policy transfer methods either transfer knowledge by measuring similarities between source and target tasks, or select best policy by estimating performance of source policies on target task. However, performance estimation can sometimes be unreliable, which can lead to negative transfer. To solve this problem, this paper develops a novel policy transfer method called stochastic ensemble policy transfer (SEPT), which generates a teacher policy to make transfer instead of choosing a policy from source policy library. SEPT changes policy transfer into the option learning problem to get termination probability. Then teacher policy is integrated from policy library and probability weights of source policies are calculated from termination probability. The knowledge of teacher policy is transferred by policy distillation. Experimental results show SEPT accelerates RL effectively and outperforms other state-of-the-art policy transfer methods in both discrete and continuous action spaces.

Key words: transfer learning (TL), reinforcement learning (RL), policy transfer, option learning, ensemble, policy distillation

中图分类号: