计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (4): 1032-1046.DOI: 10.3778/j.issn.1673-9418.2211106

• 人工智能·模式识别 • 上一篇    下一篇

潜在空间中的策略搜索强化学习方法

赵婷婷,王莹,孙威,陈亚瑞,王嫄,杨巨成   

  1. 天津科技大学 人工智能学院,天津 300457
  • 出版日期:2024-04-01 发布日期:2024-04-01

Policy Search Reinforcement Learning Method in Latent Space

ZHAO Tingting, WANG Ying, SUN Wei, CHEN Yarui, WANG Yuan, YANG Jucheng   

  1. College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, China
  • Online:2024-04-01 Published:2024-04-01

摘要: 策略搜索是深度强化学习领域中一种能够解决大规模连续状态空间和动作空间问题的高效学习方法,被广泛应用在现实问题中。然而,此类方法通常需要花费大量的学习样本和训练时间,且泛化能力较差,学到的策略模型难以泛化至环境中看似微小的变化。为了解决上述问题,提出了一种基于潜在空间的策略搜索强化学习方法。将学习状态表示的思想拓展到动作表示上,即在动作表示的潜在空间中学习策略,再将动作表示映射到真实动作空间中。通过表示学习模型的引入,摒弃端到端的训练方式,将整个强化学习任务划分成大规模的表示模型部分和小规模的策略模型部分,使用无监督的学习方法来学习表示模型,使用策略搜索强化学习方法学习小规模的策略模型。大规模的表示模型能保留应有的泛化性和表达能力,小规模的策略模型有助于减轻策略学习的负担,从而在一定程度上缓解深度强化学习领域中样本利用率低、学习效率低和动作选择泛化性弱的问题。最后,在智能控制任务CarRacing和Cheetah中验证了引入潜在空间中的状态表示和动作表示的有效性。

关键词: 无模型强化学习, 策略模型, 状态表示, 动作表示, 连续动作空间, 策略搜索强化学习方法

Abstract: Policy search is an efficient learning method in the field of deep reinforcement learning (DRL), which is capable of solving large-scale problems with continuous state and action spaces and widely used in real-world problems. However, such method usually requires a large number of trajectory samples and extensive training time, and may suffer from poor generalization ability, making it difficult to generalize the learned policy model to seemingly small changes in the environment. In order to solve the above problems, this paper proposes a policy search DRL method based on latent space. Specifically, this paper extends the idea of state representation learning to action representation learning, i.e. learning a policy in the latent space of action representations, and then mapping the action representations to the real action space. With the introduction of representation learning models, this paper abandons the traditional end-to-end training manner in DRL and divides the whole task into two stages: large-scale representation model learning and the small-scale policy model learning, where unsupervised learning methods are employed to learn the representation models and policy search methods are used to learn the small-scale policy model. Large-scale representation models can ensure the capacity for generalization and expressiveness, while small-scale policy model can reduce the burden of policy learning, thus alleviating the issues of low sample utilization, low learning efficiency and weak generalization of action selection in DRL to some extent. Finally, the effectiveness of introducing the latent state and action representations is demonstrated by the intelligent control task CarRacing and Cheetah.

Key words: model-free reinforcement learning, policy model, state representations, action representations, continuous action space, policy search reinforcement learning method