潜在空间中的策略搜索强化学习方法

doi:10.3778/j.issn.1673-9418.2211106

摘要/Abstract

摘要： 策略搜索是深度强化学习领域中一种能够解决大规模连续状态空间和动作空间问题的高效学习方法，被广泛应用在现实问题中。然而，此类方法通常需要花费大量的学习样本和训练时间，且泛化能力较差，学到的策略模型难以泛化至环境中看似微小的变化。为了解决上述问题，提出了一种基于潜在空间的策略搜索强化学习方法。将学习状态表示的思想拓展到动作表示上，即在动作表示的潜在空间中学习策略，再将动作表示映射到真实动作空间中。通过表示学习模型的引入，摒弃端到端的训练方式，将整个强化学习任务划分成大规模的表示模型部分和小规模的策略模型部分，使用无监督的学习方法来学习表示模型，使用策略搜索强化学习方法学习小规模的策略模型。大规模的表示模型能保留应有的泛化性和表达能力，小规模的策略模型有助于减轻策略学习的负担，从而在一定程度上缓解深度强化学习领域中样本利用率低、学习效率低和动作选择泛化性弱的问题。最后，在智能控制任务CarRacing和Cheetah中验证了引入潜在空间中的状态表示和动作表示的有效性。

关键词: 无模型强化学习, 策略模型, 状态表示, 动作表示, 连续动作空间, 策略搜索强化学习方法

Abstract: Policy search is an efficient learning method in the field of deep reinforcement learning (DRL), which is capable of solving large-scale problems with continuous state and action spaces and widely used in real-world problems. However, such method usually requires a large number of trajectory samples and extensive training time, and may suffer from poor generalization ability, making it difficult to generalize the learned policy model to seemingly small changes in the environment. In order to solve the above problems, this paper proposes a policy search DRL method based on latent space. Specifically, this paper extends the idea of state representation learning to action representation learning, i.e. learning a policy in the latent space of action representations, and then mapping the action representations to the real action space. With the introduction of representation learning models, this paper abandons the traditional end-to-end training manner in DRL and divides the whole task into two stages: large-scale representation model learning and the small-scale policy model learning, where unsupervised learning methods are employed to learn the representation models and policy search methods are used to learn the small-scale policy model. Large-scale representation models can ensure the capacity for generalization and expressiveness, while small-scale policy model can reduce the burden of policy learning, thus alleviating the issues of low sample utilization, low learning efficiency and weak generalization of action selection in DRL to some extent. Finally, the effectiveness of introducing the latent state and action representations is demonstrated by the intelligent control task CarRacing and Cheetah.

Key words: model-free reinforcement learning, policy model, state representations, action representations, continuous action space, policy search reinforcement learning method

赵婷婷, 王莹, 孙威, 陈亚瑞, 王嫄, 杨巨成. 潜在空间中的策略搜索强化学习方法[J]. 计算机科学与探索, 2024, 18(4): 1032-1046.

ZHAO Tingting, WANG Ying, SUN Wei, CHEN Yarui, WANG Yuan, YANG Jucheng. Policy Search Reinforcement Learning Method in Latent Space[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(4): 1032-1046.

参考文献

[1] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 2018.
[2] LI Y. Deep reinforcement learning: an overview[J]. arXiv:1701.07274, 2017.
[3] NOAEEN M, NAIK A, GOODMAN L, et al. Reinforcement learning in urban network traffic signal control: a systematic literature review[J]. Expert Systems with Applications, 2022, 199: 116830.
[4] ABDOOS M, BAZZAN A L C. Hierarchical traffic signal optimization using reinforcement learning and traffic prediction with long-short term memory[J]. Expert Systems with Applications, 2021, 171: 114580.
[5] 肖硕, 黄珍珍, 张国鹏, 等. 基于SAC的多智能体深度强化学习算法[J]. 电子学报, 2021, 49(9): 1675-1681.
XIAO S, HUANG Z Z, ZHANG G P, et al. Deep reinforcement learning algorithm of multi-agent based on SAC[J]. Acta Electronica Sinica, 2021, 49(9): 1675-1681.
[6] BRUNKE L, GREEFF M, HALL A W, et al. Safe learning in robotics: from learning-based control to safe reinforcement learning[J]. arXiv:2108.06266, 2021.
[7] SINGH B, KUMAR R, SINGH V P. Reinforcement learning in robotic applications: a comprehensive survey[J]. Artificial Intelligence Review, 2022, 55(2): 945-990.
[8] MIAO Y, BLUNSOM P, SPECIA L. A generative framework for simultaneous machine translation[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 6697-6706.
[9] CHEN M, LIU W, WANG T, et al. A game-based deep reinforcement learning approach for energy-efficient computation in MEC systems[J]. Knowledge-Based Systems, 2022, 235: 107660.
[10] KIRAN B R, SOBH I, TALPAERT V, et al. Deep reinforcement learning for autonomous driving: a survey[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(6): 4909-4926.
[11] CHEN J, LI S E, TOMIZUKA M. Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 23(6): 5068-5078.
[12] KUMARI A, TANWAR S. A reinforcement-learning-based secure demand response scheme for smart grid system[J]. IEEE Internet of Things Journal, 2021, 9(3): 2180-2191.
[13] 郭宪. 深入浅出强化学习: 原理入门[M]. 北京: 电子工业出版社, 2018.
GUO X. Head first reinforcement learning: an introduction to the principles[M]. Beijing: Electronic Industry Press, 2018.
[14] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[15] VAN HASSELT H, GUEZ A, SILVER D. Deep reinforcement learning with double Q-learning[C]//Proceedings of the 2016 AAAI Conference on Artificial Intelligence, Phoenix, Feb 12-17, 2016. Menlo Park: AAAI, 2016: 2094-2100.
[16] SCHAUL T, QUAN J, ANTONOGLOU I, et al. Prioritized experience replay[C]//Proceedings of the 4th International Conference on Learning Representations, San Juan, May 2-4, 2016.
[17] WANG Z, SCHAUL T, HESSEL M, et al. Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Lear-ning, New York, Jun 19-24, 2016: 1995-2003.
[18] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3): 229-256.
[19] PETERS J, SCHAAL S. Policy gradient methods for robotics[C]//Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, Oct 9-15, 2006. Piscataway: IEEE, 2006: 2219-2225.
[20] KONDA V R, TSITSIKLIS J N. Actor-critic algorithms[C]//Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2000: 1008-1014.
[21] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[J]. arXiv:1509. 02971, 2015.
[22] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning, New York, Jun 19-24, 2016: 1928-1937.
[23] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[J]. arXiv:1707.06347, 2017.
[24] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the 35th International Conference on Machine Learning, Stockholm, Jul 10-15, 2018: 1856-1865.
[25] HESSEL M, MODAYIL J, VAN HASSELT H, et al. Rainbow: combining improvements in deep reinforcement learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2018: 3215-3222.
[26] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.
[27] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning[J]. arXiv:1912.06680, 2019.
[28] HEESS N, TB D, SRIRAM S, et al. Emergence of locomotion behaviours in rich environments[J]. arXiv:1707.02286, 2017.
[29] HA D, SCHMIDHUBER J. World models[J]. arXiv:1803. 10122, 2018.
[30] WATTER M, SPRINGENBERG J T, BOEDECKER J, et al. Embed to control: a locally linear latent dynamics model for control from raw images[C]//Advances in Neural Information Processing Systems 28, Montreal, Dec 7-12, 2015: 2746-2754.
[31] LIU X, ZHANG F, HOU Z, et al. Self-supervised learning: generative or contrastive[J]. IEEE Transactions on Knowledge and Data Engineering, 2021, 35(1): 857-876.
[32] QI C, ZHU Y, SONG C, et al. Self-supervised reinforcement learning-based energy management for a hybrid electric vehicle[J]. Journal of Power Sources, 2021, 514: 230584.
[33] BANIJAMALI E, SHU R, BUI H, et al. Robust locally-linear controllable embedding[C]//Proceedings of the 2018 International Conference on Artificial Intelligence and Statistics, Playa Blanca, Apr 9-11, 2018: 1751-1759.
[34] KINGMA D P, WELLING M. Auto-encoding variational Bayes[J]. arXiv:1312.6114, 2013.
[35] DOERSCH C. Tutorial on variational autoencoders[J]. arXiv: 1606.05908, 2016.
[36] HAFNER D, LILLICRAP T, FISCHER I, et al. Learning latent dynamics for planning from pixels[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2555-2565.
[37] HAFNER D, LILLICRAP T, BA J, et al. Dream to control: learning behaviors by latent imagination[J]. arXiv:1912.01603, 2019.
[38] HAFNER D, LILLICRAP T, NOROUZI M, et al. Mastering Atari with discrete world models[J]. arXiv:2010.02193, 2020.
[39] 林景栋, 吴欣怡, 柴毅, 等. 卷积神经网络结构优化综述[J]. 自动化学报, 2020, 46(1): 24-37.
LIN J D, WU X Y, CHAI Y, et al. A review on structural optimization of convolutional neural networks[J]. Acta Automatica Sinica, 2020, 46(1): 24-37.
[40] GELADA C, KUMAR S, BUCKMAN J, et al. DeepMDP: learning continuous latent space models for representation learning[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2170-2179.
[41] HINDERER K. Lipschitz continuity of value functions in Markovian decision processes[J]. Mathematical Methods of Operations Research, 2005, 62: 3-22.
[42] ASADI K, MISRA D, LITTMAN M. Lipschitz continuity in model-based reinforcement learning[C]//Proceedings of the 35th International Conference on Machine Learning, Stockholm, Jul 10-15, 2018: 264-273.
[43] ZHANG A, MCALLISTER R T, CALANDRA R, et al. Learning invariant representations for reinforcement learning without reconstruction[C]//Proceedings of the 9th International Conference on Learning Representations, May 3-7, 2021.
[44] DULAC-ARNOLD G, DENOYER L, PREUX P, et al. Fast reinforcement learning with large action sets using error-correcting output codes for MDP factorization[C]//Proceedings of the 2012 Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Bristol, Sep 24-28, 2012. Berlin, Heidelberg: Springer, 2012: 180-194.
[45] DIETTERICH T G, BAKIRI G. Solving multiclass learning problems via error-correcting output codes[J]. Journal of Artificial Intelligence Research, 1994, 2: 263-286.
[46] LAGOUDAKIS M G, PARR R. Reinforcement learning as classification: leveraging modern classifiers[C]//Proceedings of the 20th International Conference on Machine Lear-ning. Menlo Park: AAAI, 2003: 424-431.
[47] DULAC-ARNOLD G, EVANS R, VAN HASSELT H, et al. Deep reinforcement learning in large discrete action spaces[J]. arXiv:1512.07679, 2015.
[48] CHANDAK Y, THEOCHAROUS G, KOSTAS J, et al. Learning action representations for reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 941-950.
[49] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning, Lille, Jul 6-11, 2015: 1889-1897.
[50] BENGIO Y, COURVILLE A, VINCENT P. Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.
[51] AYACHI R, SAID Y, ATRI M. A convolutional neural network to perform object detection and identification in visual large-scale data[J]. Big Data, 2021, 9(1): 41-52.
[52] GIBADULLIN R F, PERUKHIN M Y, ILIN A V. Speech recognition and machine translation using neural networks[C]//Proceedings of the 2021 International Conference on Industrial Engineering, Applications and Manufacturing, Sochi, May 17-21, 2021. Piscataway: IEEE, 2021: 398-403.
[53] LAURIOLA I, LAVELLI A, AIOLLI F. An introduction to deep learning in natural language processing: models, techniques, and tools[J]. Neurocomputing, 2022, 470: 443-456.
[54] PATEL H, UPLA K P. A shallow network for hyperspectral image classification using an autoencoder with convolutional neural network[J]. Multimedia Tools and Applications, 2022, 81(1): 695-714.
[55] SABOKROU M, FATHY M, HOSEINI M. Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder[J]. Electronics Letters, 2016, 52(13): 1122-1124.
[56] CHANG Y, TU Z, XIE W, et al. Clustering driven deep auto- encoder for video anomaly detection[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 329-345.
[57] HAMMOUCHE R, ATTIA A, AKHROUF S, et al. Gabor filter bank with deep autoencoder based face recognition system[J]. Expert Systems with Applications, 2022, 197: 116743.
[58] 李耿增. 基于变分自编码器的图像压缩[D]. 北京: 北京邮电大学, 2021.
LI G Z. Image compression based on variational self-encoder[D]. Beijing: Beijing University of Posts and Telecommunications, 2021.