结合对比预测的离线元强化学习方法

doi:10.3778/j.issn.1673-9418.2203074

摘要/Abstract

摘要： 传统的强化学习算法与环境进行大量交互才能获取稳定的动作选择策略，且无法有效应对任务环境发生改变的情况，因此难以应用于实际问题。离线元强化学习通过使用包含多个任务的经验回放数据集进行离线策略学习，为复杂情况下智能体快速适应新任务提供了一种有效方法。将离线元强化学习算法应用于复杂任务将会面临两个挑战：首先，由于无法与环境进行充分交互，离线强化学习算法会错误估计数据集外动作的价值，进而选择次优动作；其次，元强化学习算法不仅需要学习动作选择策略，还需要具备稳健而高效的任务推理能力。针对上述挑战，提出一种结合对比预测的离线元强化学习算法。为了应对价值函数的错误估计问题，该算法使用行为克隆技术鼓励策略选择包含在数据集中的动作。为了提高元学习的任务推理能力，该算法使用循环神经网络对智能体上下文轨迹进行任务推理，并利用对比学习和预测网络来分析辨别不同任务轨迹中的潜在结构。实验结果表明，相比现有方法，使用该算法训练得到的智能体在面对未见过的任务时的得分提高了25个百分点以上，并且具有更高的元训练效率和更好的泛化性能。

关键词: 深度强化学习, 离线元强化学习, 对比学习

Abstract: Traditional reinforcement learning algorithms require lots of online interaction with the environment for training and cannot effectively adapt to changes in the task environment, making them difficult to apply to real-world problems. Offline meta-reinforcement learning provides an effective way to quickly adapt to a new task by using replay datasets of multiple tasks for offline policy learning. Applying offline meta-reinforcement learning to complex tasks will face two challenges. Firstly, reinforcement learning algorithms overestimate the value of state-action pairs not contained in the dataset and thus select non-optimal actions, resulting in poor performance. Secondly, meta-reinforcement learning algorithms need not only to learn the policy but also to have robust and efficient task inference capabilities. To address the above problems, this paper proposes an offline meta-reinfor-cement learning algorithm based on contrastive prediction. To cope with the problem of overestimation of value functions, the proposed algorithm uses behavior cloning to encourage policy to prefer actions included in the dataset. To improve the task inference capability of meta-learning, the proposed algorithm uses recurrent neural networks for task inference on the contextual trajectories of the agents and uses contrastive learning and prediction networks to analyze and distinguish potential structures in different task trajectories. Experimental results show that the agents trained by the proposed algorithm can score more than 25 percentage points when faced with unseen tasks, and it has higher meta-training efficiency and better generalization performance compared with existing methods.

Key words: deep reinforcement learning, offline meta-reinforcement learning, contrastive learning

韩旭, 吴锋. 结合对比预测的离线元强化学习方法[J]. 计算机科学与探索, 2023, 17(8): 1917-1927.

HAN Xu, WU Feng. Offline Meta-Reinforcement Learning with Contrastive Prediction[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(8): 1917-1927.

参考文献

[1] ARULKUMARAN K, MARC D, MILES B, et al. Deep reinforcement learning: a brief survey[J].?IEEE Signal Pro-cessing Magazine,?2016, 34(6): 26-38.
[2] KOBER J, BAGNELL J A, PETERS J. Reinforcement lear-ning in robotics: a survey[J]. The International Journal of Robotics Research, 2013, 32(11): 1238-1274.
[3] SUTTON R S, BARTO A G. Reinforcement learning: an in-troduction[M]. Cambridge: MIT Press, 2018.
[4] LEVINE S, KUMAR A, TUCKER G, et al. Offline reinfor-cement learning: tutorial, review, and perspectives on open problems[J]. arXiv:2005.01643, 2020.
[5] FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep rein-forcement learning without exploration[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019. Cambridge: JMLR, 2019: 2052-2062.
[6] KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-learning for offline reinforcement learning[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1179-1191.
[7] ERNST D, GEURTS P, WEHENKEL L. Tree-based batch mode reinforcement learning[J]. Journal of Machine Learning Research, 2005, 6: 503-556.
[8] WU Y, TUCKER G, NACHUM O. Behavior regularized offline reinforcement learning[J]. arXiv:1911.11361, 2019.
[9] FINN C, ABBEEL P, LEVINE S. Model-agnostic meta-lear-ning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning, Sydney, Aug 6-11, 2017. Cambridge: JMLR, 2017: 1126-1135.
[10] GUPTA A, MENDONCA R, LIU Y, et al. Meta-reinforcement learning of structured exploration strategies[C]//Advances in Neural Information Processing Systems, 31, Montréal, Dec 3-8, 2018. New York: Curran Associates, 2018: 5302-5311.
[11] ROTHFUSS J, LEE D, CLAVERA I, et al. ProMP: proximal meta-policy search[C]//Proceedings of the 2019 International Conference on Learning Representations, New Orleans, May 6-9, 2019: 1-25.
[12] RAKELLY K, ZHOU A, FINN C, et al. Efficient off-policy meta-reinforcement learning via probabilistic context varia-bles[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 5331-5340.
[13] LI J, VUONG Q, LIU S, et al. Multi-task batch reinforcement learning with metric learning[C]//Advances in Neural Infor-mation Processing Systems 33, Dec 6-12, 2020. New York: Curran Associates, 2020: 6197-6210.
[14] LI L, YANG R, LUO D. FOCAL: efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization[C]//Proceedings of the 9th International Conference on Learning Representations, May 3-7, 2021: 1-11.
[15] MITCHELL E, RAFAILOV R, PENG X B, et al. Offline meta-reinforcement learning with advantage weighting[C]//Proceedings of the 38th International Conference on Ma-chine Learning, Jul 18-24, 2021: 7780-7791.
[16] PENG X B, KUMAR A, ZHANG G, et al. Advantage-weighted regression: simple and scalable off-policy reinforcement lear-ning[J]. arXiv:1910.00177, 2019.
[17] OORD A V, LI Y, VINYALS O. Representation learning with contrastive predictive coding[J]. arXiv:1807.03748, 2018.
[18] HE K, FAN H, WU Y, et al. Momentum contrast for unsu-pervised visual representation learning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 9729-9738.
[19] LASKIN M, SRINIVAS A, ABBEEL P. CURL: contrastive unsupervised representations for reinforcement learning[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 12-18, 2020: 5639-5650.
[20] FU H, TANG H, HAO J, et al. Towards effective context for meta-reinforcement learning: an approach based on contras-tive learning[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, the 33rd Conference on Innovative Applications of Artificial Intelligence, the 11th Symposium on Educational Advances in Artificial Intelligence, Feb 2-9, 2021. Palo Alto: AAAI Press, 2021: 7457-7465.
[21] FUJIMOTO S, GU S S. A minimalist approach to offline reinforcement learning[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 20132-20145.
[22] LI L, HUANG Y, CHEN M, et al. Provably improved con-text-based offline meta-RL with attention and contrastive learning[J]. arXiv:2102.10774, 2021.
[23] FAKOOR R, CHAUDHARI P, SOATTO S, et al. Meta-Q-Learning[C]//Proceedings of the 2020 International Conference on Learning Representations, Apr 26-May 1, 2020: 1-17.
[24] ZHOU W, PINTO L, GUPTA A. Environment probing inter-action policies[C]//Proceedings of the 2019 International Con-ference on Learning Representations, New Orleans, May 6-9, 2019: 1-13.
[25] LEE K, SEO Y, LEE S, et al. Context-aware dynamics model for generalization in model-based reinforcement learning[C]//Proceedings of the 37th International Conference on Machine Learning, Jul 12-18, 2020: 5757-5766.
[26] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv:1412.3555, 2014.
[27] KOSTRIKOV I, FERGUS R, TOMPSON J, et al. Offline reinforcement learning with fisher divergence critic regula-rization[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 5774-5783.
[28] FUJIMOTO S, HOOF H, MEGER D. Addressing function approximation error in actor-critic methods[C]//Proceedings of the 35th International Conference on Machine Learning, Stockholmsm?ssan, Jul 10-15, 2018: 1582-1591.
[29] TODOROV E, EREZ T, TASSA Y. MuJoCo: a physics engine for model-based control[C]//Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura, Oct 7-11, 2012. Piscataway: IEEE, 2012: 5026-5033.
[30] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the 35th Interna-tional Conference on Machine Learning, Stockholmsm?ssan, Jul 10-15, 2018: 1861-1870.