多智能体强化学习算法研究综述

doi:10.3778/j.issn.1673-9418.2401020

摘要/Abstract

摘要： 近年来，多智能体强化学习算法技术已广泛应用于人工智能领域。系统性地分析了多智能体强化学习算法，审视了其在多智能体系统中的应用与进展，并深入调研了相关研究成果。介绍了多智能体强化学习的研究背景和发展历程，并总结了已有的相关研究成果；简要回顾了传统强化学习算法在不同任务下的应用情况；重点强调多智能体强化学习算法分类，并根据三种主要的任务类型（路径规划、追逃博弈、任务分配）对其在多智能体系统中的应用、挑战以及解决方案进行了细致的梳理与分析；调研了多智能体领域中现有的算法训练环境，总结了深度学习对多智能体强化学习算法的改进作用，提出该领域所面临的挑战并展望了未来的研究方向。

关键词: 智能体, 强化学习, 多智能体强化学习, 多智能体系统

Abstract: In recent years, the technique of multi-agent reinforcement learning algorithm has been widely used in the field of artificial intelligence. This paper systematically analyses the multi-agent reinforcement learning algorithm, examines its application and progress in multi-agent systems, and explores the relevant research results in depth. Firstly, it introduces the research background and development history of multi-agent reinforcement learning and summarizes the existing relevant research results. Secondly, it briefly reviews the application of traditional reinforcement learning algorithms under different tasks. Then, it highlights the classification of multi-agent reinforcement learning algorithms and their application in multi-agent systems according to the three main types of tasks (path planning, pursuit and escape game, task allocation), challenges, and solutions. Finally, it explores the existing algorithm training environments in the field of multi-agents, summarizes the improvement of deep learning on multi-agent reinforcement learning algorithms, proposes challenges and looks forward to future research directions in this field.

Key words: agent, reinforcement learning, multi-agent reinforcement learning, multi-agent systems

李明阳, 许可儿, 宋志强, 夏庆锋, 周鹏. 多智能体强化学习算法研究综述[J]. 计算机科学与探索, 2024, 18(8): 1979-1997.

LI Mingyang, XU Ke’er, SONG Zhiqiang, XIA Qingfeng, ZHOU Peng. Review of Research on Multi-agent Reinforcement Learning Algorithms[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 1979-1997.

参考文献

[1] OH K K, PARK M C, AHN H S. A survey of multi-agent formation control[J]. Automatica, 2015, 53: 424-440.
[2] 胡凯, 郑翡, 卢飞宇, 等. 基于深度学习的行为识别算法综述[J]. 南京信息工程大学学报, 2021, 13(6): 730-743.
HU K, ZHENG F, LU F Y, et al. A survey of action recognition algorithms based on deep learning[J]. Journal of Nanjing University of Information Science & Technology, 2021, 13(6): 730-743.
[3] MATIGNON L, LAURENT G J, LE FORT-PIAT N. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems[J]. The Knowledge Engineering Review, 2012, 27(1): 1-31.
[4] 殷昌盛, 杨若鹏, 朱巍, 等. 多智能体分层强化学习综述[J]. 智能系统学报, 2020, 15(4): 646-655.
YIN C S, YANG R P, ZHU W, et al. A survey on multi-agent hierarchical reinforcement learning[J]. CAAI Transactions on Intelligent Systems, 2020, 15(4): 646-655.
[5] 邹启杰, 蒋亚军, 高兵, 等. 协作多智能体深度强化学习研究综述[J]. 航空兵器, 2022, 29(6): 78-88.
ZOU Q J, JIANG Y J, GAO B, et al. An overview of cooperative multi-agent deep reinforcement learning[J]. Aero Weaponry, 2022, 29(6): 78-88.
[6] JADERBERG M, CZARNECKI W M, DUNNING I, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning[J]. Science, 2019, 364(6443): 859-865.
[7] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.
[8] NGUYEN T T, NGUYEN N D, NAHAVANDI S. Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications[J]. IEEE Transactions on Cybernetics, 2020, 50(9): 3826-3839.
[9] HERNANDEZ-LEAL P, KARTAL B, TAYLOR M E. A survey and critique of multiagent deep reinforcement learning[J]. Autonomous Agents and Multi-Agent Systems, 2019, 33(6): 750-797.
[10] OROOJLOOYJADID A, HAJINEZHAD D. A review of cooperative multi-agent deep reinforcement learning[J]. Applied Intelligence, 2023, 53(11): 13677-13722.
[11] DA SILVA F L, COSTA A H R. A survey on transfer learning for multiagent reinforcement learning systems[J]. Journal of Artificial Intelligence Research, 2019, 64: 645-703.
[12] DA SILVA F L, WARNELL G, COSTA A H R, et al. Agents teaching agents: a survey on inter-agent transfer learning[J]. Autonomous Agents and Multi-Agent Systems, 2020, 34: 1-17.
[13] LAZARIDOU A, BARONI M. Emergent multi-agent communication in the deep learning era[EB/OL]. [2023-11-23].https://arxiv.org/abs/2006.02419.
[14] ZHANG K, YANG Z, BA?AR T. Multi-agent reinforcement learning: a selective overview of theories and algorithms[M]//Handbook of Reinforcement Learning and Control. Cham: Springer, 2021: 321-384.
[15] KR?SE B J A. Learning from delayed rewards[J]. Robotics and Autonomous Systems, 1995, 15(4): 233-235.
[16] LIN Y P, LI X Y. Reinforcement learning based on local state feature learning and policy adjustment[J]. Information Sciences, 2003, 154(1/2): 59-70.
[17] HWANG K S, TAN S W, CHEN C C. Cooperative strategy based on adaptive Q-learning for robot soccer systems[J]. IEEE Transactions on Fuzzy Systems, 2004, 12(4): 569-576.
[18] GUO M Z, LIU Y, MALEC J. A new Q-learning algorithm based on the metropolis criterion[J]. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 2004, 34(5): 2140-2143.
[19] BOUBERTAKH H, TADJINE M, GLORENNEC P Y. A new mobile robot navigation method using fuzzy logic and a modified Q-learning algorithm[J]. Journal of Intelligent & Fuzzy Systems, 2010, 21(1/2): 113-119.
[20] RAHIMIYAN M, MASHHADI H R. An adaptive Q-learning algorithm developed for agent-based computational modeling of electricity market[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2010, 40(5): 547-556.
[21] ZHOU Y, ZHOU F, WU Y, et al. Subcarrier assignment schemes based on Q-learning in wideband cognitive radio networks[J]. IEEE Transactions on Vehicular Technology, 2019, 69(1): 1168-1172.
[22] CHUNG W C, CHANG C J, FENG K T, et al. An MIMO configuration mode and MCS level selection scheme by fuzzy Q-Learning for HSPA+ systems[J]. IEEE Transactions on Mobile Computing, 2012, 11(7): 1151-1162.
[23] SHAMS F, BACCI G, LUISE M. Energy-efficient power control for multiple-relay cooperative networks using Q-learning[J]. IEEE Transactions on Wireless Communications, 2014, 14(3): 1567-1580.
[24] AISSANI N, BELDJILALI B, TRENTESAUX D. Dynamic scheduling of maintenance tasks in the petroleum industry: a reinforcement approach[J]. Engineering Applications of Artificial Intelligence, 2009, 22(7): 1089-1103.
[25] DERHAMI V, MAJD V J, AHMADABADI M N. Exploration and exploitation balance management in fuzzy reinforcement learning[J]. Fuzzy Sets and Systems, 2010, 161(4): 578-595.
[26] ANDRECUT M, ALI M K. Deep-Sarsa: a reinforcement learning algorithm for autonomous navigation[J]. International Journal of Modern Physics C, 2001, 12(10): 1513-1523.
[27] OLYAEI M H, JALALI H, OLYAEI A, et al. Implement deep SARSA in grid world with changing obstacles and testing against new environment[C]//Fundamental Research in Electrical Engineering: the Selected Papers of the 1st International Conference on Fundamental Research in Electrical Engineering. Singapore: Springer, 2019: 267-279.
[28] LUO W, TANG Q, FU C, et al. Deep-Sarsa based multi-UAV path planning and obstacle avoidance in a dynamic environment[C]//Advances in Swarm Intelligence: Proceedings of the 9th International Conference, Shanghai, Jun 17-22, 2018. Cham: Springer, 2018: 102-111.
[29] BELLMAN R. Dynamic programming[J]. Science, 1966, 153(3731): 34-37.
[30] AMINI A A, WEYMOUTH T E, JAIN R C. Using dynamic programming for solving variational problems in vision[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1990, 12(9): 855-867.
[31] MERLET N, ZERUBIA J. New prospects in line detection by dynamic programming[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1996, 18(4): 426-431.
[32] BUCKLEY M, YANG J. Regularised shortest-path extraction[J]. Pattern Recognition Letters, 1997, 18(7): 621-629.
[33] WERBOS P. Advanced forecasting methods for global crisis warning and models of intelligence[J]. General System Year-book, 1977, 22: 25-38.
[34] MILLER W T. A menu of designs for reinforcement learning over time[M]//MILLER W T, SUTTON R S, WERBOS P J. Neural Networks for Control. Cambridge: MIT Press, 1995: 67-95.
[35] POWELL W B. Approximate dynamic programming: solving the curses of dimensionality[M]. New York: John Wiley & Sons, 2007.
[36] JOY M, KAISARE N S. Approximate dynamic programming-based control of distributed parameter systems[J]. Asia-Pacific Journal of Chemical Engineering, 2011, 6(3): 452-459.
[37] 马琦, 刘杨, 吴贤生, 等. 基于价值迭代算法的最优渗透路径发现[J]. 计算机系统应用, 2023, 32(12): 197-204.
MA Q, LIU Y, WU X S, et al. Optimal penetration path discovery based on value iterative algorithm[J]. Computer Systems & Applications, 2023, 32(12): 197-204.
[38] LIU D, WANG D, ZHAO D, et al. Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming[J]. IEEE Transactions on Automation Science and Engineering, 2012, 9(3): 628-634.
[39] VRABIE D, VAMVOUDAKIS K G, LEWIS F L. Optimal adaptive control and differential games by reinforcement learning principles[J]. IEEE Control Systems Magazine, 2014, 34(3): 80-82.
[40] YUAN Y, HUA L, CHENG Y, et al. A novel model-based reinforcement learning algorithm for solving the problem of unbalanced reward[J]. Journal of Intelligent & Fuzzy Systems, 2023, 44(2): 3233-3243.
[41] KLEINMAN D. On an iterative technique for Riccati equation computations[J]. IEEE Transactions on Automatic Control, 1968, 13(1): 114-115.
[42] SUTTON R S, BARTO A G. Introduction to reinforcement learning[M]. Cambridge: MIT Press, 1998.
[43] 程玉虎, 冯涣婷, 王雪松. 基于状态-动作图测地高斯基的策略迭代强化学习[J]. 自动化学报, 2011, 37(1): 44-51.
CHENG Y H, FENG H T, WANG X S. Policy iteration reinforcement learning based on geodesic Gaussian basis defined on state-action graph[J]. Acta Automatica Sinica, 2011, 37(1): 44-51.
[44] LEWIS F L, VRABIE D. Reinforcement learning and adaptive dynamic programming for feedback control[J]. IEEE Circuits and Systems Magazine, 2009, 9(3): 32-50.
[45] ZHANG H, ZHANG J, YANG G H, et al. Leader-based optimal coordination control for the consensus problem of multiagent differential games via fuzzy adaptive dynamic programming[J]. IEEE Transactions on Fuzzy Systems, 2014, 23(1): 152-163.
[46] 杨思明, 单征, 丁煜, 等. 深度强化学习研究综述[J]. 计算机工程, 2021, 47(12): 19-29.
YANG S M, SHAN Z, DING Y, et al. Survey of research on deep reinforcement learning[J]. Computer Engineering, 2021, 47(12): 19-29.
[47] RUPPRECHT T, WANG Y. A survey for deep reinforcement learning in Markovian cyber-physical systems: common problems and solutions[J]. Neural Networks, 2022, 153: 13-36.
[48] 羊波, 王琨, 马祥祥, 等. 多智能体强化学习的机械臂运动控制决策研究[J].计算机工程与应用, 2023, 59(6): 318-325.
YANG B, WANG K, Ma X X, et al. Research on motion control method of manipulator based on reinforcement learning[J]. Computer Engineering and Applications, 2023, 59(6): 318-325.
[49] YAOZHONG Z, ZHUORAN W U, ZHENKAI X, et al. A UAV collaborative defense scheme driven by DDPG algorithm[J]. Journal of Systems Engineering and Electronics, 2023, 34(5): 1211-1224.
[50] ZHAO L, ZHANG Y, DANG Z. PRD-MADDPG: an efficient learning-based algorithm for orbital pursuit-evasion game with impulsive maneuvers[J]. Advances in Space Research, 2023, 72(2): 211-230.
[51] 刘志飞, 董强, 赖俊, 等. 多智能体强化学习在直升机机场调度中的应用[J]. 计算机工程与应用, 2023, 59(16): 285-294.
LIU Z F, DONG Q, LAI J, et al. Multi-agent reinforcement learning in helicopter airport dispatching[J]. Computer Engineering and Applications, 2023, 59(16): 285-294.
[52] ZHANG M, PAN C. Hierarchical optimization scheduling algorithm for logistics transport vehicles based on multi-agent reinforcement learning[J]. IEEE Transactions on Intelligent Transportation Systems, 2024, 25(3): 3108-3117.
[53] MOON J, PAPAIOANNOU S, LAOUDIAS C, et al. Deep reinforcement learning multi-UAV trajectory control for target tracking[J]. IEEE Internet of Things Journal, 2021, 8(20): 15441-15455.
[54] XU Y, WEI Y, JIANG K, et al. Multiple UAVs path planning based on deep reinforcement learning in communication denial environment[J]. Mathematics, 2023, 11(2): 405.
[55] GAN W, QU X, SONG D, et al. Multi-USV cooperative chasing strategy based on obstacles assistance and deep reinforcement learning[J]. IEEE Transactions on Automation Science and Engineering, 2023: 1-16.
[56] XU C, SONG W. An adaptive data uploading scheme for mobile crowdsensing via deep reinforcement learning with graph neural network[J]. IEEE Internet of Things Journal, 2022, 9(18): 18064-18078.
[57] XU C, SONG W. Decentralized task assignment for mobile crowdsensing with multi-agent deep reinforcement learning[J]. IEEE Internet of Things Journal, 2023, 10(18): 16564-16578.
[58] 宋旺, 胡祥, 张玉辉, 等. 一种全局供需感知的均值场多智能体强化学习订单分配算法[J]. 数据采集与处理, 2023, 38(3): 652-664.
SONG W, HU X, ZHANG Y H, et al. Mean-field multi-agent reinforcement learning order dispatch algorithm with awareness of global supply-demand dynamics[J]. Journal of Data Acquisition and Processing, 2023, 38(3): 652-664.
[59] LOWE R, WU Y I, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 6379-6390.
[60] IQBAL S, SHA F. Actor-attention-critic for multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2961-2970.
[61] XU H, FANG Q, HU C, et al. MIRA: model-based imagined rollouts augmentation for non-stationarity in multi-agent systems[J]. Mathematics, 2022, 10(17): 3059.
[62] ZHANG X, ZHENG K, WANG C, et al. A novel deep reinforcement learning for POMDP-based autonomous ship collision decision-making[J]. Neural Computing and Applications, 2023: 1-15.
[63] LI L, ZHAO W, WANG C. POMDP motion planning algorithm based on multi-modal driving intention[J]. IEEE Tran-sactions on Intelligent Vehicles, 2022, 8(2): 1777-1786.
[64] TAN X, ZHOU L, WANG H, et al. Cooperative multi-agent reinforcement-learning-based distributed dynamic spectrum access in cognitive radio networks[J]. IEEE Internet of Things Journal, 2022, 9(19): 19477-19488.
[65] AZZAM R, BOIKO I, ZWEIRI Y. Swarm cooperative navigation using centralized training and decentralized execution[J]. Drones, 2023, 7(3): 193.
[66] 肖国庆, 李雪琪, 陈玥丹, 等. 大规模图神经网络研究综述[J]. 计算机学报, 2024, 47(1): 148-171.
XIAO G Q, LI X Q, CHEN Y D, et al. A survey of large-scale graph neural networks[J]. Chinese Journal of Computers, 2024, 47(1): 148-171.
[67] HU Y, FU J, WEN G. Graph soft actor-critic reinforcement learning for large-scale distributed multirobot coordination[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023: 1-12.
[68] JIANG Z, CHEN Y, WANG K, et al. A graph-based PPO approach in multi-UAV navigation for communication coverage[J]. International Journal of Computers Communications & Control, 2023, 18(6).
[69] 胡鸿翔, 梁锦, 温广辉, 等. 多智能体系统的群集行为研究综述[J]. 南京信息工程大学学报, 2018, 10(4): 415-421.
HU H X, LIANG J, WEN G H, et al. A survey of development on swarming behavior for multi-agent systems[J]. Journal of Nanjing University of Information Science & Technology, 2018, 10(4): 415-421.
[70] 赵婷婷, 孔乐, 韩雅杰, 等. 模型化强化学习研究综述[J]. 计算机科学与探索, 2020, 14(6): 918-927.
ZHAO T T, KONG L, HAN Y J, et al. Review of model-based reinforcement learning[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(6): 918-927.
[71] NIAN X H, LI M M, WANG H B, et al. Large-scale UAV swarm confrontation based on hierarchical attention actor-critic algorithm[J]. Applied Intelligence, 2024, 54(4): 3279-3294.
[72] YANG X Y, CUI T X, WANG H R, et al. Multiagent deep reinforcement learning for electric vehicle fast charging station pricing game in electricity-transportation nexus[J]. IEEE Transactions on Industrial Informatics, 2024, 20(4): 6345-6355.
[73] 曾贲, 房霄, 孔德帅, 等. 一种数据驱动的对抗博弈智能体建模方法[J]. 系统仿真学报, 2021, 33(12): 2838-2845.
ZENG B, FANG X, KONG D S, et al. A data-driven modeling method for game adversity agent[J]. Journal of System Simulation, 2021, 33(12): 2838-2845.
[74] 陈人龙, 陈嘉礼, 李善琦, 等. 多智能体强化学习方法综述[J]. 信息对抗技术, 2024, 3(1): 18-32.
CHEN R L, CHEN J L, LI S Q, et al. A survey of multi-agent reinforcement learning methods[J]. Information Coun-termeasure Technology, 2024, 3(1): 18-32.
[75] DAI D, BOROOMAND S. A review of artificial intelligence to enhance the security of big data systems: state-of-art, methodologies, applications, and challenges[J]. Archives of Computational Methods in Engineering, 2022, 29(2): 1291-1309.
[76] YAO Q, WANG Y J, XIONG X L, et al. Adversarial decision-making for moving target defense: a multi-agent Markov game and reinforcement learning approach[J]. Entropy, 2023, 25(4): 605.