潜在空间中深度强化学习方法研究综述

doi:10.3778/j.issn.1673-9418.2211113

摘要/Abstract

摘要： 深度强化学习（DRL）是实现通用人工智能的一种有效学习范式，已在一系列实际应用中取得了显著成果。然而，DRL存在泛化性能差、样本效率低等问题。基于深度神经网络的表示学习通过学习环境的底层结构，能够有效缓解上述问题。因此，基于潜在空间的深度强化学习成为该领域的主流方法。系统地综述了基于潜在空间的表示学习在深度强化学习中的研究进展，分析并总结了现有基于潜在空间的深度强化学习的方法，将其分为潜在空间中的状态表示、动作表示以及动力学模型进行详细阐述。其中，潜在空间中的状态表示又被分为基于重构方式的状态表示方法、基于互模拟等价的状态表示方法及其他状态表示方法。最后，列举了现有基于潜在空间的强化学习在游戏领域、智能控制领域、推荐领域及其他领域的成功应用，并浅谈了该领域的未来发展趋势。

关键词: 强化学习, 深度学习, 潜在空间, 状态表示, 动作表示

Abstract: Deep reinforcement learning (DRL) is an effective learning paradigm to realize general artificial intelligence, and has achieved remarkable achievements in a series of real-world applications. However, deep reinforcement learning has some challenges, such as generalization capability and sample efficiency. Representation learning based on deep neural networks can effectively alleviate the above problems by learning the underlying structure of the environment. Therefore, latent space based deep reinforcement learning has become the popular method in this field. A systematic review is conducted on the research progress of representation learning based on latent space in deep reinforcement learning. Existing methods of deep reinforcement learning based on latent space are analyzed and summarized, and they are categorized into state representation, action representation, and dynamics model in the latent space. Within the state representation in the latent space, it is further divided into methods based on reconstruction, methods based on mutual imitation equivalence, and other state representation methods. Finally, successful applications of deep reinforcement learning based on latent space in areas such as gaming, intelligent control, recommendation systems, and other domains are presented, followed by a brief discussion on the future development trends in this field.

Key words: reinforcement learning, deep learning, latent space, state representation, action representation

赵婷婷, 孙威, 陈亚瑞, 王嫄, 杨巨成. 潜在空间中深度强化学习方法研究综述[J]. 计算机科学与探索, 2023, 17(9): 2047-2074.

ZHAO Tingting, SUN Wei, CHEN Yarui, WANG Yuan, YANG Jucheng. Review of Deep Reinforcement Learning in Latent Space[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(9): 2047-2074.

参考文献

[1] SUTTON R S, BARTO A G. Reinforcement learning: an in-troduction[M]. Cambridge: MIT Press, 2018.
[2] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[3] VAN HASSELT H, GUEZ A, SILVER D. Deep reinforcement learning with double Q-learning[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, Feb 12-17, 2016. Menlo Park: AAAI, 2016: 5.
[4] SCHAUL T, QUAN J, ANTONOGLOU I, et al. Prioritized experience replay[J]. arXiv:1511.05952, 2015.
[5] WANG Z, SCHAUL T, HESSEL M, et al. Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning, New York, Jun 19-24, 2016: 1995-2003.
[6] KONDA V, TSITSIKLIS J. Actor-critic algorithms[C]//Ad-vances in Neural Information Processing Systems 12, Denver, Nov 29-Dec 4, 1999. Cambridge: MIT Press, 2000: 1008-1014.
[7] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[J]. arXiv:1509.02971, 2015.
[8] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning, New York, Jun 19-24, 2016: 1928-1937.
[9] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[J]. arXiv:1707.06347, 2017.
[10] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the 35th International Conference on Machine Learning, Stockholmsm?ssan, Jul 10-15, 2018: 1861-1870.
[11] ABDOOS M, BAZZAN A L C. Hierarchical traffic signal optimization using reinforcement learning and traffic pre-diction with long-short term memory[J]. Expert Systems with Applications, 2021, 171: 114580.
[12] NOAEEN M, NAIK A, GOODMAN L, et al. Reinforcement learning in urban network traffic signal control: a systema-tic literature review[J]. Expert Systems with Applications, 2022: 116830.
[13] 肖硕, 黄珍珍, 张国鹏, 等. 基于SAC的多智能体深度强化学习算法[J]. 电子学报, 2021, 49(9): 1675-1681.
XIAO S, HUANG Z Z, ZHANG G P, et al. Deep reinforce-ment learning algorithm of multi-agent based on SAC[J]. Acta Electronica Sinica, 2021, 49(9): 1675-1681.
[14] BRUNKE L, GREEFF M, HALL A W, et al. Safe learning in robotics: from learning-based control to safe reinforcement learning[J]. arXiv:2108.06266, 2021.
[15] ZHAO W S, QUERALTA J P, WESTERLUND T. Sim-to-real transfer in deep reinforcement learning for robotics: a survey[C]//Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence, Canberra, Dec 1-4, 2020. Piscataway: IEEE, 2020: 737-744．
[16] MIAO Y, BLUNSOM P, SPECIA L. A generative framework for simultaneous machine translation[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Nov 7-11, 2021. Menlo Park: AAAI, 2021: 6697-6706.
[17] JADERBERG M, CZARNECKI W M, DUNNING I, et al.Human-level performance in 3D multiplayer games with population-based reinforcement learning[J]. Science, 2019, 364(6443): 859-865.
[18] KIRAN B R, SOBH I, TALPAERT V, et al. Deep reinforce-ment learning for autonomous driving: a survey[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(6): 4909-4926.
[19] CHEN J, YUAN B, TOMIZUKA M. Model-free deep rein-forcement learning for urban autonomous driving[C]//Pro-ceedings of the 2019 IEEE Intelligent Transportation Systems Conference, Auckland, Oct 27-30, 2019. Piscataway: IEEE,2019: 2765-2771.
[20] NI Z, PAUL S. A multistage game in smart grid security: a reinforcement learning solution[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(9): 2684-2695.
[21] 郭宪. 深入浅出强化学习: 原理入门[M]. 北京: 电子工业出版社, 2018.
GUO X. Head first reinforcement learning: an introduction to the principles[M]. Beijing: Electronic Industry Press, 2018.
[22] HESSEL M, MODAYIL J, VAN HASSELT H, et al. Rainbow: combining improvements in deep reinforcement learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 3215-3222.
[23] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinfor-cement learning[J]. Nature, 2019, 575(7782): 350-354.
[24] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning[J]. arXiv:1912.06680, 2019.
[25] HEESS N, TB D, SRIRAM S, et al. Emergence of locomo-tion behaviours in rich environments[J]. arXiv:1707.02286, 2017.
[26] PETERS J, SCHAAL S. Policy gradient methods for robo-tics[C]//Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, Oct 9-15, 2006. Piscataway: IEEE, 2006: 2219-2225.
[27] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786): 504-507.
[28] SCHMIDHUBER J. Formal theory of creativity, fun, and intrinsic motivation (1990—2010)[J]. IEEE Transactions on Autonomous Mental Development, 2010, 2(3): 230-247.
[29] CHANDAK Y, THEOCHAROUS G, KOSTAS J, et al. Learning action representations for reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 941-950.
[30] FRAN?OIS-LAVET V, BENGIO Y, PRECUP D, et al. Combined reinforcement learning via abstract representa-tions[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019: 3582-3589.
[31] BENGIO Y, COURVILLE A, VINCENT P. Representation learning: a review and new perspectives[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.
[32] HAFNER D, LILLICRAP T, FISCHER I, et al. Learning latent dynamics for planning from pixels[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2555-2565.
[33] GELADA C, KUMAR S, BUCKMAN J, et al. DeepMDP: learning continuous latent space models for representation learning[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2170-2179.
[34] ZHANG A, MCALLISTER R, CALANDRA R, et al. Lear-ning invariant representations for reinforcement learning without reconstruction[J]. arXiv:2006.10742, 2020.
[35] TAGHANAKI S A, CHOI K, KHASAHMADI A H, et al. Robust representation learning via perceptual similarity metrics[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 10043-10053.
[36] OORD A, LI Y, VINYALS O. Representation learning with contrastive predictive coding[J]. arXiv:1807.03748, 2018.
[37] DULAC-ARNOLD G, EVANS R, VAN HASSELT H, et al. Deep reinforcement learning in large discrete action spaces[J]. arXiv:1512.07679, 2015.
[38] ZAHAVY T, HAROUSH M, MERLIS N, et al. Learn what not to learn: action elimination with deep reinforcement learning[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2018, Montréal, Dec 3-8, 2018: 3566-3577.
[39] ZHAO T, WANG Y, SUN W, et al. Representation learning for continuous action spaces is beneficial for efficient policy learning[J]. Neural Networks, 2023, 159: 137-152.
[40] HA D, SCHMIDHUBER J. World models[J]. arXiv:1803.10122, 2018.
[41] HAFNER D, LILLICRAP T, BA J, et al. Dream to control: learning behaviors by latent imagination[J]. arXiv:1912.01603, 2019.
[42] HAFNER D, LILLICRAP T, NOROUZI M, et al. Mastering Atari with discrete world models[J]. arXiv:2010.02193, 2020.
[43] VAN OVERSCHEE P, DE MOOR B L. Subspace identification for linear systems: theory-implementation-applications[M]. Berlin, Heidelberg: Springer, 2012.
[44] SHUMWAY R H, STOFFER D S. An approach to time series smoothing and forecasting using the EM algorithm[J]. Journal of Time Series Analysis, 1982, 3(4): 253-264.
[45] GILES C L. Adaptive processing of sequences and data structures[M]. Berlin, Heidelberg: Springer, 1998.
[46] BALAKRISHNAN V. System identification: theory for the user (second edition)[J]. Automatica, 2002, 38(2): 375-378.
[47] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the 25th International Conference on Machine Learning, Jul 2008. New York: ACM, 2008: 1096-1103.
[48] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[49] GUTOSKI M, RIBEIRO M, AQUINO N M R, et al. A clus-tering-based deep autoencoder for one-class image classifi-cation[C]//Proceedings of the 2017 IEEE Latin American Conference on Computational Intelligence, Arequipa, Nov 8-10, 2017. Piscataway: IEEE, 2017: 1-6.
[50] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 20(1): 30-42.
[51] BORDES A, GLOROT X, WESTON J, et al. Joint learning of words and meaning representations for open-text semantic parsing[C]//Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, La Palma, Apr 21-23, 2012: 127-135.
[52] MNIH A, TEH Y W. A fast and simple algorithm for training neural probabilistic language models[J]. arXiv:1206.6426, 2012.
[53] JOZEFOWICZ R, VINYALS O, SCHUSTER M, et al. Ex-ploring the limits of language modeling[J]. arXiv:1602.02410, 2016.
[54] SABOKROU M, FATHY M, HOSEINI M. Video anomaly detection and localisation based on the sparsity and recons-truction error of auto-encoder[J]. Electronics Letters, 2016, 52(13): 1122-1124.
[55] CHANG Y, TU Z, XIE W, et al. Clustering driven deep autoencoder for video anomaly detection[C]//LNCS 12360:Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 329-345.
[56] LIU H, TANIGUCHI T. Feature extraction and pattern reco-gnition for human motion by a deep sparse autoencoder[C]//Proceedings of the 14th IEEE International Conference on Computer and Information Technology, Xi??an, Sep 11-13, 2014. Washington: IEEE Computer Society, 2014: 173-181.
[57] LANGE S, RIEDMILLER M. Deep auto-encoder neural net-works in reinforcement learning[C]//Proceedings of the 2010 International Joint Conference on Neural Networks, Barcelona, Jul 18-23, 2010. Piscataway: IEEE, 2010: 1-8.
[58] LANGE S, RIEDMILLER M, VOIGTL?NDER A. Autono-mous reinforcement learning on raw visual input data in a real world application[C]//Proceedings of the 2012 Interna-tional Joint Conference on Neural Networks, Brisbane, Jun 10-15, 2012. Piscataway: IEEE, 2012: 1-8.
[59] KINGMA D P, WELLING M. Auto-encoding variational Bayes[J]. arXiv:1312.6114, 2013.
[60] DOERSCH C. Tutorial on variational autoencoders[J]. arXiv: 1606.05908, 2016.
[61] 李耿增. 基于变分自编码器的图像压缩[D]. 北京: 北京邮电大学, 2021.
LI G Z. Image compression based on variational self-encoder[D]. Beijing: Beijing University of Posts and Telecommuni-cations, 2021.
[62] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]// Advances in Neural Information Processing Systems 25, Lake Tahoe, Dec 3-6, 2012: 1106-1114.
[63] LAGOUDAKIS M G, PARR R. Reinforcement learning as classification: leveraging modern classifiers[C]//Proceedings of the 20th International Conference on Machine Learning，Washington, Aug 21-24, 2003. Menlo Park: AAAI, 2003: 424-431.
[64] CHANG L, TSAO D Y. The code for facial identity in the primate brain[J]. Cell, 2017, 169(6): 1013-1028.
[65] QUIROGA R Q, REDDY L, KREIMAN G, et al. Invariant visual representation by single neurons in the human brain[J]. Nature, 2005, 435(7045): 1102-1107.
[66] DEISENROTH M, RASMUSSEN C E. PILCO: a model-based and data-efficient approach to policy search[C]//Pro-ceedings of the 28th International Conference on Machine Learning, Bellevue, Jun 28-Jul 2, 2011. Madison: Omnipress, 2011: 465-472.
[67] GAL Y, MCALLISTER R, RASMUSSEN C E. Improving PILCO with Bayesian neural network dynamics models[C]//Proceedings of the 2016 Data-Efficient Machine Learning Workshop, New York, 2016.
[68] AMOS B, DINH L, CABI S, et al. Learning awareness models[J]. arXiv:1804.06318, 2018.
[69] CHUA K, CALANDRA R, MCALLISTER R, et al. Deep reinforcement learning in a handful of trials using probabi-listic dynamics models[C]//Advances in Neural Information Processing Systems 31, Montréal, Dec 3-8, 2018: 4759-4770.
[70] HENAFF M, WHITNEY W F, LECUN Y. Model-based plan-ning with discrete and continuous actions[J]. arXiv:1705.07177, 2017.
[71] TASSA Y, EREZ T, TODOROV E. Synthesis and stabilization of complex behaviors through online trajectory optimization[C]//Proceedings of the 2012 IEEE/RSJ International Con-ference on Intelligent Robots and Systems, Vilamoura, Oct 7-12, 2012. Piscataway: IEEE, 2012: 4906-4913.
[72] WANG T, BAO X, CLAVERA I, et al. Benchmarking model-based reinforcement learning[J]. arXiv:1907.02057, 2019.
[73] 林景栋, 吴欣怡, 柴毅, 等. 卷积神经网络结构优化综述[J]. 自动化学报, 2020, 46(1): 24-37.
LIN J D, WU X Y, CHAI Y, et al. Structure optimization of convolutional neural networks: a survey[J]. Acta Automa-tica Sinica , 2020, 46(1): 24-37.
[74] LEE A X, NAGABANDI A, ABBEEL P, et al. Stochastic latent actor-critic: deep reinforcement learning with a latent variable model[C]//Advances in Neural Information Proces-sing Systems 33,?Dec?6-12,?2020: 741-752.
[75] BELLEMARE M G, NADDAF Y, VENESS J, et al. The arcade learning environment: an evaluation platform for general agents[J]. Journal of Artificial Intelligence Research, 2013, 47: 253-279.
[76] GIVAN R, DEAN T, GREIG M. Equivalence notions and model minimization in Markov decision processes[J]. Artificial Intelligence, 2003, 147(1/2): 163-223.
[77] DOSOVITSKIY A, ROS G, CODEVILLA F, et al. CARLA: an open urban driving simulator[C]//Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, Nov 13-15, 2017: 1-16.
[78] FERNS N, PANANGADEN P, PRECUP D. Bisimulation metrics for continuous Markov decision processes[J]. SIAM Journal on Computing, 2011, 40(6): 1662-1714.
[79] TAYLOR J, PRECUP D, PANAGADEN P. Bounding per-formance loss in approximate MDP homomorphisms[C]//Advances in Neural Information Processing Systems 21, Vancouver, Dec 8-11, 2008. Red Hook: Curran Associates, 2009: 1649-1656.
[80] HINDERER K. Lipschitz continuity of value functions in Markovian decision processes[J]. Mathematical Methods of Operations Research, 2005, 62: 3-22.
[81] ASADI K, MISRA D, LITTMAN M. Lipschitz continuity in model-based reinforcement learning[C]//Proceedings of the 2018 International Conference on Machine Learning, Stockholm, 2018: 264-273.
[82] HANSEN-ESTRUCH P, ZHANG A, NAIR A, et al. Bisi-mulation makes analogies in goal-conditioned reinforce-ment learning[C]//Proceedings of the 2022 International Conference on Machine Learning, Baltimore, Jul 17-23, 2022: 8407-8426.
[83] LIU Q, ZHOU Q, YANG R, et al. Robust representation learning by clustering with bisimulation metrics for visual reinforcement learning with distractions[J]. arXiv:2302.12003, 2023.
[84] LESORT T, DíAZ-RODRíGUEZ N, GOUDOU J F, et al. State representation learning for control: an overview[J]. Neural Networks, 2018, 108: 379-392.
[85] HOUTHOOFT R, CHEN X, DUAN Y, et al. VIME: variational information maximizing exploration[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2016, Barcelona, Dec 5-10, 2016. Red Hook: Curran Asso-ciates, 2016: 1109-1117.
[86] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//Procee-dings of the 34th International Conference on Machine Learning, Sydney, Aug 6-11, 2017: 2778-2787.
[87] TAO R Y, FRAN?OIS-LAVET V, PINEAU J. Novelty search in representational space for sample efficient exploration[C]//Advances in Neural Information Processing Systems 33, Dec?6-12,?2020: 8114-8126.
[88] TISHBY N, PEREIRA F C, BIALEK W. The information bottleneck method[J]. arXiv:physics/0004057, 2000.
[89] ALEMI A A, FISCHER I, DILLON J V, et al. Deep varia-tional information bottleneck[J]. arXiv:1612.00410, 2016.
[90] ELIAS P. Predictive coding—I[J]. IRE Transactions on In-formation Theory, 1955, 1(1): 16-24.
[91] ATAL B S, SCHROEDER M R. Adaptive predictive coding of speech signals[J]. Bell System Technical Journal, 1970, 49(8): 1973-1986.
[92] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[93] WISKOTT L, SEJNOWSKI T J. Slow feature analysis: unsupervised learning of invariances[J]. Neural Computa-tion, 2002, 14(4): 715-770.
[94] REN T, ZHANG T, LEE L, et al. Spectral decomposition re-presentation for reinforcement learning[J]. arXiv:2208.09515, 2022.
[95] H?FTMANN M, ROBINE J, HARMELING S. Time-myopic go-explore: learning a state representation for the go-explore paradigm[J]. arXiv:2301.05635, 2023.
[96] JIANG Z, XU D, LIANG J. A deep reinforcement learning framework for the financial portfolio management problem[J]. arXiv:1706.10059, 2017.
[97] GLAVIC M, FONTENEAU R, ERNST D. Reinforcement learning for electric power system decision and control: past considerations and perspectives[J]. IFAC-PapersOnLine, 2017, 50(1): 6918-6927.
[98] DULAC-ARNOLD G, DENOYER L, PREUX P, et al. Fast reinforcement learning with large action sets using error-correcting output codes for MDP factorization[C]//LNCS 7524: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases,Bristol, Sep 24-28, 2012: 180-194.
[99] DIETTERICH T G, BAKIRI G. Solving multiclass learning problems via error-correcting output codes[J]. Journal of Artificial Intelligence Research, 1994, 2: 263-286.
[100] EVEN-DAR E, MANNOR S, MANSOUR Y, et al. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems[J]. Journal of Machine Learning Research, 2006, 7: 1079-1105.
[101] WAHLSTR?M N, SCH?N T B, DEISENROTH M P. Lear-ning deep dynamical models from image pixels[J]. IFAC-PapersOnLine, 2015, 48(28): 1059-1064.
[102] WATTER M, SPRINGENBERG J T, BOEDECKER J, et al. Embed to control: a locally linear latent dynamics model for control from raw images[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates,2015: 2746-2754.
[103] BANIJAMALI E, SHU R, BUI H, et al. Robust locally-linear controllable embedding[C]//Proceedings of the 2018 International Conference on Artificial Intelligence and Sta-tistics, Playa Blanca, Apr 9-11, 2018: 1751-1759.
[104] GRAVES A. Generating sequences with recurrent neural networks[J]. arXiv:1308.0850, 2013.
[105] YE D, LIU Z, SUN M, et al. Mastering complex control in MOBA games with deep reinforcement learning[C]//Pro-ceedings of the 34th AAAI Conference on Artificial Intel-ligence, the 32nd Innovative Applications of Artificial In-telligence Conference, the 10th AAAI Symposium on Edu-cational Advances in Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 6672-6679.
[106] KENDALL A, HAWKE J, JANZ D, et al. Learning to drive in a day[C]//Proceedings of the 2019 International Conference on Robotics and Automation, Montreal, May 20-24, 2019. Piscataway: IEEE, 2019: 8248-8254.
[107] SILVER D, LEVER G, HEESS N, et al. Deterministic policy gradient algorithms[C]//Proceedings of the 31st International Conference on Machine Learning, Beijing, Jun 21-26, 2014: 387-395.
[108] OPENAI. Introducing ChatGPT[EB/OL]. [2022-09-30]. https://openai.com/blog/chatgpt.
[109] BAJPAI S. Application of deep reinforcement learning for Indian stock trading automation[J]. arXiv:2106.16088, 2021.
[110] YU C, LIU J, NEMATI S, et al. Reinforcement learning in healthcare: a survey[J]. ACM Computing Surveys, 2021, 55(1): 1-36.