计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (8): 1917-1927.DOI: 10.3778/j.issn.1673-9418.2203074

• 人工智能·模式识别 • 上一篇    下一篇

结合对比预测的离线元强化学习方法

韩旭,吴锋   

  1. 中国科学技术大学 计算机科学与技术学院,合肥 230032
  • 出版日期:2023-08-01 发布日期:2023-08-01

Offline Meta-Reinforcement Learning with Contrastive Prediction

HAN Xu, WU Feng   

  1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230032, China
  • Online:2023-08-01 Published:2023-08-01

摘要: 传统的强化学习算法与环境进行大量交互才能获取稳定的动作选择策略,且无法有效应对任务环境发生改变的情况,因此难以应用于实际问题。离线元强化学习通过使用包含多个任务的经验回放数据集进行离线策略学习,为复杂情况下智能体快速适应新任务提供了一种有效方法。将离线元强化学习算法应用于复杂任务将会面临两个挑战:首先,由于无法与环境进行充分交互,离线强化学习算法会错误估计数据集外动作的价值,进而选择次优动作;其次,元强化学习算法不仅需要学习动作选择策略,还需要具备稳健而高效的任务推理能力。针对上述挑战,提出一种结合对比预测的离线元强化学习算法。为了应对价值函数的错误估计问题,该算法使用行为克隆技术鼓励策略选择包含在数据集中的动作。为了提高元学习的任务推理能力,该算法使用循环神经网络对智能体上下文轨迹进行任务推理,并利用对比学习和预测网络来分析辨别不同任务轨迹中的潜在结构。实验结果表明,相比现有方法,使用该算法训练得到的智能体在面对未见过的任务时的得分提高了25个百分点以上,并且具有更高的元训练效率和更好的泛化性能。

关键词: 深度强化学习, 离线元强化学习, 对比学习

Abstract: Traditional reinforcement learning algorithms require lots of online interaction with the environment for training and cannot effectively adapt to changes in the task environment, making them difficult to apply to real-world problems. Offline meta-reinforcement learning provides an effective way to quickly adapt to a new task by using replay datasets of multiple tasks for offline policy learning. Applying offline meta-reinforcement learning to complex tasks will face two challenges. Firstly, reinforcement learning algorithms overestimate the value of state-action pairs not contained in the dataset and thus select non-optimal actions, resulting in poor performance. Secondly, meta-reinforcement learning algorithms need not only to learn the policy but also to have robust and efficient task inference capabilities. To address the above problems, this paper proposes an offline meta-reinfor-cement learning algorithm based on contrastive prediction. To cope with the problem of overestimation of value functions, the proposed algorithm uses behavior cloning to encourage policy to prefer actions included in the dataset. To improve the task inference capability of meta-learning, the proposed algorithm uses recurrent neural networks for task inference on the contextual trajectories of the agents and uses contrastive learning and prediction networks to analyze and distinguish potential structures in different task trajectories. Experimental results show that the agents trained by the proposed algorithm can score more than 25 percentage points when faced with unseen tasks, and it has higher meta-training efficiency and better generalization performance compared with existing methods.

Key words: deep reinforcement learning, offline meta-reinforcement learning, contrastive learning