计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (8): 1979-1997.DOI: 10.3778/j.issn.1673-9418.2401020
李明阳,许可儿,宋志强,夏庆锋,周鹏
出版日期:
2024-08-01
发布日期:
2024-07-29
LI Mingyang, XU Ke’er, SONG Zhiqiang, XIA Qingfeng, ZHOU Peng
Online:
2024-08-01
Published:
2024-07-29
摘要: 近年来,多智能体强化学习算法技术已广泛应用于人工智能领域。系统性地分析了多智能体强化学习算法,审视了其在多智能体系统中的应用与进展,并深入调研了相关研究成果。介绍了多智能体强化学习的研究背景和发展历程,并总结了已有的相关研究成果;简要回顾了传统强化学习算法在不同任务下的应用情况;重点强调多智能体强化学习算法分类,并根据三种主要的任务类型(路径规划、追逃博弈、任务分配)对其在多智能体系统中的应用、挑战以及解决方案进行了细致的梳理与分析;调研了多智能体领域中现有的算法训练环境,总结了深度学习对多智能体强化学习算法的改进作用,提出该领域所面临的挑战并展望了未来的研究方向。
李明阳, 许可儿, 宋志强, 夏庆锋, 周鹏. 多智能体强化学习算法研究综述[J]. 计算机科学与探索, 2024, 18(8): 1979-1997.
LI Mingyang, XU Ke’er, SONG Zhiqiang, XIA Qingfeng, ZHOU Peng. Review of Research on Multi-agent Reinforcement Learning Algorithms[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 1979-1997.
[1] OH K K, PARK M C, AHN H S. A survey of multi-agent formation control[J]. Automatica, 2015, 53: 424-440. [2] 胡凯, 郑翡, 卢飞宇, 等. 基于深度学习的行为识别算法综述[J]. 南京信息工程大学学报, 2021, 13(6): 730-743. HU K, ZHENG F, LU F Y, et al. A survey of action recognition algorithms based on deep learning[J]. Journal of Nanjing University of Information Science & Technology, 2021, 13(6): 730-743. [3] MATIGNON L, LAURENT G J, LE FORT-PIAT N. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems[J]. The Knowledge Engineering Review, 2012, 27(1): 1-31. [4] 殷昌盛, 杨若鹏, 朱巍, 等. 多智能体分层强化学习综述[J]. 智能系统学报, 2020, 15(4): 646-655. YIN C S, YANG R P, ZHU W, et al. A survey on multi-agent hierarchical reinforcement learning[J]. CAAI Transactions on Intelligent Systems, 2020, 15(4): 646-655. [5] 邹启杰, 蒋亚军, 高兵, 等. 协作多智能体深度强化学习研究综述[J]. 航空兵器, 2022, 29(6): 78-88. ZOU Q J, JIANG Y J, GAO B, et al. An overview of cooperative multi-agent deep reinforcement learning[J]. Aero Weaponry, 2022, 29(6): 78-88. [6] JADERBERG M, CZARNECKI W M, DUNNING I, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning[J]. Science, 2019, 364(6443): 859-865. [7] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354. [8] NGUYEN T T, NGUYEN N D, NAHAVANDI S. Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications[J]. IEEE Transactions on Cybernetics, 2020, 50(9): 3826-3839. [9] HERNANDEZ-LEAL P, KARTAL B, TAYLOR M E. A survey and critique of multiagent deep reinforcement learning[J]. Autonomous Agents and Multi-Agent Systems, 2019, 33(6): 750-797. [10] OROOJLOOYJADID A, HAJINEZHAD D. A review of cooperative multi-agent deep reinforcement learning[J]. Applied Intelligence, 2023, 53(11): 13677-13722. [11] DA SILVA F L, COSTA A H R. A survey on transfer learning for multiagent reinforcement learning systems[J]. Journal of Artificial Intelligence Research, 2019, 64: 645-703. [12] DA SILVA F L, WARNELL G, COSTA A H R, et al. Agents teaching agents: a survey on inter-agent transfer learning[J]. Autonomous Agents and Multi-Agent Systems, 2020, 34: 1-17. [13] LAZARIDOU A, BARONI M. Emergent multi-agent communication in the deep learning era[EB/OL]. [2023-11-23].https://arxiv.org/abs/2006.02419. [14] ZHANG K, YANG Z, BA?AR T. Multi-agent reinforcement learning: a selective overview of theories and algorithms[M]//Handbook of Reinforcement Learning and Control. Cham: Springer, 2021: 321-384. [15] KR?SE B J A. Learning from delayed rewards[J]. Robotics and Autonomous Systems, 1995, 15(4): 233-235. [16] LIN Y P, LI X Y. Reinforcement learning based on local state feature learning and policy adjustment[J]. Information Sciences, 2003, 154(1/2): 59-70. [17] HWANG K S, TAN S W, CHEN C C. Cooperative strategy based on adaptive Q-learning for robot soccer systems[J]. IEEE Transactions on Fuzzy Systems, 2004, 12(4): 569-576. [18] GUO M Z, LIU Y, MALEC J. A new Q-learning algorithm based on the metropolis criterion[J]. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 2004, 34(5): 2140-2143. [19] BOUBERTAKH H, TADJINE M, GLORENNEC P Y. A new mobile robot navigation method using fuzzy logic and a modified Q-learning algorithm[J]. Journal of Intelligent & Fuzzy Systems, 2010, 21(1/2): 113-119. [20] RAHIMIYAN M, MASHHADI H R. An adaptive Q-learning algorithm developed for agent-based computational modeling of electricity market[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2010, 40(5): 547-556. [21] ZHOU Y, ZHOU F, WU Y, et al. Subcarrier assignment schemes based on Q-learning in wideband cognitive radio networks[J]. IEEE Transactions on Vehicular Technology, 2019, 69(1): 1168-1172. [22] CHUNG W C, CHANG C J, FENG K T, et al. An MIMO configuration mode and MCS level selection scheme by fuzzy Q-Learning for HSPA+ systems[J]. IEEE Transactions on Mobile Computing, 2012, 11(7): 1151-1162. [23] SHAMS F, BACCI G, LUISE M. Energy-efficient power control for multiple-relay cooperative networks using Q-learning[J]. IEEE Transactions on Wireless Communications, 2014, 14(3): 1567-1580. [24] AISSANI N, BELDJILALI B, TRENTESAUX D. Dynamic scheduling of maintenance tasks in the petroleum industry: a reinforcement approach[J]. Engineering Applications of Artificial Intelligence, 2009, 22(7): 1089-1103. [25] DERHAMI V, MAJD V J, AHMADABADI M N. Exploration and exploitation balance management in fuzzy reinforcement learning[J]. Fuzzy Sets and Systems, 2010, 161(4): 578-595. [26] ANDRECUT M, ALI M K. Deep-Sarsa: a reinforcement learning algorithm for autonomous navigation[J]. International Journal of Modern Physics C, 2001, 12(10): 1513-1523. [27] OLYAEI M H, JALALI H, OLYAEI A, et al. Implement deep SARSA in grid world with changing obstacles and testing against new environment[C]//Fundamental Research in Electrical Engineering: the Selected Papers of the 1st International Conference on Fundamental Research in Electrical Engineering. Singapore: Springer, 2019: 267-279. [28] LUO W, TANG Q, FU C, et al. Deep-Sarsa based multi-UAV path planning and obstacle avoidance in a dynamic environment[C]//Advances in Swarm Intelligence: Proceedings of the 9th International Conference, Shanghai, Jun 17-22, 2018. Cham: Springer, 2018: 102-111. [29] BELLMAN R. Dynamic programming[J]. Science, 1966, 153(3731): 34-37. [30] AMINI A A, WEYMOUTH T E, JAIN R C. Using dynamic programming for solving variational problems in vision[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1990, 12(9): 855-867. [31] MERLET N, ZERUBIA J. New prospects in line detection by dynamic programming[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1996, 18(4): 426-431. [32] BUCKLEY M, YANG J. Regularised shortest-path extraction[J]. Pattern Recognition Letters, 1997, 18(7): 621-629. [33] WERBOS P. Advanced forecasting methods for global crisis warning and models of intelligence[J]. General System Year-book, 1977, 22: 25-38. [34] MILLER W T. A menu of designs for reinforcement learning over time[M]//MILLER W T, SUTTON R S, WERBOS P J. Neural Networks for Control. Cambridge: MIT Press, 1995: 67-95. [35] POWELL W B. Approximate dynamic programming: solving the curses of dimensionality[M]. New York: John Wiley & Sons, 2007. [36] JOY M, KAISARE N S. Approximate dynamic programming-based control of distributed parameter systems[J]. Asia-Pacific Journal of Chemical Engineering, 2011, 6(3): 452-459. [37] 马琦, 刘杨, 吴贤生, 等. 基于价值迭代算法的最优渗透路径发现[J]. 计算机系统应用, 2023, 32(12): 197-204. MA Q, LIU Y, WU X S, et al. Optimal penetration path discovery based on value iterative algorithm[J]. Computer Systems & Applications, 2023, 32(12): 197-204. [38] LIU D, WANG D, ZHAO D, et al. Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming[J]. IEEE Transactions on Automation Science and Engineering, 2012, 9(3): 628-634. [39] VRABIE D, VAMVOUDAKIS K G, LEWIS F L. Optimal adaptive control and differential games by reinforcement learning principles[J]. IEEE Control Systems Magazine, 2014, 34(3): 80-82. [40] YUAN Y, HUA L, CHENG Y, et al. A novel model-based reinforcement learning algorithm for solving the problem of unbalanced reward[J]. Journal of Intelligent & Fuzzy Systems, 2023, 44(2): 3233-3243. [41] KLEINMAN D. On an iterative technique for Riccati equation computations[J]. IEEE Transactions on Automatic Control, 1968, 13(1): 114-115. [42] SUTTON R S, BARTO A G. Introduction to reinforcement learning[M]. Cambridge: MIT Press, 1998. [43] 程玉虎, 冯涣婷, 王雪松. 基于状态-动作图测地高斯基的策略迭代强化学习[J]. 自动化学报, 2011, 37(1): 44-51. CHENG Y H, FENG H T, WANG X S. Policy iteration reinforcement learning based on geodesic Gaussian basis defined on state-action graph[J]. Acta Automatica Sinica, 2011, 37(1): 44-51. [44] LEWIS F L, VRABIE D. Reinforcement learning and adaptive dynamic programming for feedback control[J]. IEEE Circuits and Systems Magazine, 2009, 9(3): 32-50. [45] ZHANG H, ZHANG J, YANG G H, et al. Leader-based optimal coordination control for the consensus problem of multiagent differential games via fuzzy adaptive dynamic programming[J]. IEEE Transactions on Fuzzy Systems, 2014, 23(1): 152-163. [46] 杨思明, 单征, 丁煜, 等. 深度强化学习研究综述[J]. 计算机工程, 2021, 47(12): 19-29. YANG S M, SHAN Z, DING Y, et al. Survey of research on deep reinforcement learning[J]. Computer Engineering, 2021, 47(12): 19-29. [47] RUPPRECHT T, WANG Y. A survey for deep reinforcement learning in Markovian cyber-physical systems: common problems and solutions[J]. Neural Networks, 2022, 153: 13-36. [48] 羊波, 王琨, 马祥祥, 等. 多智能体强化学习的机械臂运动控制决策研究[J].计算机工程与应用, 2023, 59(6): 318-325. YANG B, WANG K, Ma X X, et al. Research on motion control method of manipulator based on reinforcement learning[J]. Computer Engineering and Applications, 2023, 59(6): 318-325. [49] YAOZHONG Z, ZHUORAN W U, ZHENKAI X, et al. A UAV collaborative defense scheme driven by DDPG algorithm[J]. Journal of Systems Engineering and Electronics, 2023, 34(5): 1211-1224. [50] ZHAO L, ZHANG Y, DANG Z. PRD-MADDPG: an efficient learning-based algorithm for orbital pursuit-evasion game with impulsive maneuvers[J]. Advances in Space Research, 2023, 72(2): 211-230. [51] 刘志飞, 董强, 赖俊, 等. 多智能体强化学习在直升机机场调度中的应用[J]. 计算机工程与应用, 2023, 59(16): 285-294. LIU Z F, DONG Q, LAI J, et al. Multi-agent reinforcement learning in helicopter airport dispatching[J]. Computer Engineering and Applications, 2023, 59(16): 285-294. [52] ZHANG M, PAN C. Hierarchical optimization scheduling algorithm for logistics transport vehicles based on multi-agent reinforcement learning[J]. IEEE Transactions on Intelligent Transportation Systems, 2024, 25(3): 3108-3117. [53] MOON J, PAPAIOANNOU S, LAOUDIAS C, et al. Deep reinforcement learning multi-UAV trajectory control for target tracking[J]. IEEE Internet of Things Journal, 2021, 8(20): 15441-15455. [54] XU Y, WEI Y, JIANG K, et al. Multiple UAVs path planning based on deep reinforcement learning in communication denial environment[J]. Mathematics, 2023, 11(2): 405. [55] GAN W, QU X, SONG D, et al. Multi-USV cooperative chasing strategy based on obstacles assistance and deep reinforcement learning[J]. IEEE Transactions on Automation Science and Engineering, 2023: 1-16. [56] XU C, SONG W. An adaptive data uploading scheme for mobile crowdsensing via deep reinforcement learning with graph neural network[J]. IEEE Internet of Things Journal, 2022, 9(18): 18064-18078. [57] XU C, SONG W. Decentralized task assignment for mobile crowdsensing with multi-agent deep reinforcement learning[J]. IEEE Internet of Things Journal, 2023, 10(18): 16564-16578. [58] 宋旺, 胡祥, 张玉辉, 等. 一种全局供需感知的均值场多智能体强化学习订单分配算法[J]. 数据采集与处理, 2023, 38(3): 652-664. SONG W, HU X, ZHANG Y H, et al. Mean-field multi-agent reinforcement learning order dispatch algorithm with awareness of global supply-demand dynamics[J]. Journal of Data Acquisition and Processing, 2023, 38(3): 652-664. [59] LOWE R, WU Y I, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 6379-6390. [60] IQBAL S, SHA F. Actor-attention-critic for multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2961-2970. [61] XU H, FANG Q, HU C, et al. MIRA: model-based imagined rollouts augmentation for non-stationarity in multi-agent systems[J]. Mathematics, 2022, 10(17): 3059. [62] ZHANG X, ZHENG K, WANG C, et al. A novel deep reinforcement learning for POMDP-based autonomous ship collision decision-making[J]. Neural Computing and Applications, 2023: 1-15. [63] LI L, ZHAO W, WANG C. POMDP motion planning algorithm based on multi-modal driving intention[J]. IEEE Tran-sactions on Intelligent Vehicles, 2022, 8(2): 1777-1786. [64] TAN X, ZHOU L, WANG H, et al. Cooperative multi-agent reinforcement-learning-based distributed dynamic spectrum access in cognitive radio networks[J]. IEEE Internet of Things Journal, 2022, 9(19): 19477-19488. [65] AZZAM R, BOIKO I, ZWEIRI Y. Swarm cooperative navigation using centralized training and decentralized execution[J]. Drones, 2023, 7(3): 193. [66] 肖国庆, 李雪琪, 陈玥丹, 等. 大规模图神经网络研究综述[J]. 计算机学报, 2024, 47(1): 148-171. XIAO G Q, LI X Q, CHEN Y D, et al. A survey of large-scale graph neural networks[J]. Chinese Journal of Computers, 2024, 47(1): 148-171. [67] HU Y, FU J, WEN G. Graph soft actor-critic reinforcement learning for large-scale distributed multirobot coordination[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023: 1-12. [68] JIANG Z, CHEN Y, WANG K, et al. A graph-based PPO approach in multi-UAV navigation for communication coverage[J]. International Journal of Computers Communications & Control, 2023, 18(6). [69] 胡鸿翔, 梁锦, 温广辉, 等. 多智能体系统的群集行为研究综述[J]. 南京信息工程大学学报, 2018, 10(4): 415-421. HU H X, LIANG J, WEN G H, et al. A survey of development on swarming behavior for multi-agent systems[J]. Journal of Nanjing University of Information Science & Technology, 2018, 10(4): 415-421. [70] 赵婷婷, 孔乐, 韩雅杰, 等. 模型化强化学习研究综述[J]. 计算机科学与探索, 2020, 14(6): 918-927. ZHAO T T, KONG L, HAN Y J, et al. Review of model-based reinforcement learning[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(6): 918-927. [71] NIAN X H, LI M M, WANG H B, et al. Large-scale UAV swarm confrontation based on hierarchical attention actor-critic algorithm[J]. Applied Intelligence, 2024, 54(4): 3279-3294. [72] YANG X Y, CUI T X, WANG H R, et al. Multiagent deep reinforcement learning for electric vehicle fast charging station pricing game in electricity-transportation nexus[J]. IEEE Transactions on Industrial Informatics, 2024, 20(4): 6345-6355. [73] 曾贲, 房霄, 孔德帅, 等. 一种数据驱动的对抗博弈智能体建模方法[J]. 系统仿真学报, 2021, 33(12): 2838-2845. ZENG B, FANG X, KONG D S, et al. A data-driven modeling method for game adversity agent[J]. Journal of System Simulation, 2021, 33(12): 2838-2845. [74] 陈人龙, 陈嘉礼, 李善琦, 等. 多智能体强化学习方法综述[J]. 信息对抗技术, 2024, 3(1): 18-32. CHEN R L, CHEN J L, LI S Q, et al. A survey of multi-agent reinforcement learning methods[J]. Information Coun-termeasure Technology, 2024, 3(1): 18-32. [75] DAI D, BOROOMAND S. A review of artificial intelligence to enhance the security of big data systems: state-of-art, methodologies, applications, and challenges[J]. Archives of Computational Methods in Engineering, 2022, 29(2): 1291-1309. [76] YAO Q, WANG Y J, XIONG X L, et al. Adversarial decision-making for moving target defense: a multi-agent Markov game and reinforcement learning approach[J]. Entropy, 2023, 25(4): 605. |
[1] | 盛蕾, 陈希亮, 赖俊. 基于潜在状态分布GPT的离线多智能体强化学习方法[J]. 计算机科学与探索, 2024, 18(8): 2169-2179. |
[2] | 张红强, 石佳航, 吴亮红, 王汐, 左词立, 陈祖国, 刘朝华, 陈磊. 改进MADDPG算法的非凸环境下多智能体自组织协同围捕[J]. 计算机科学与探索, 2024, 18(8): 2080-2090. |
[3] | 孟珍, 任冠宇, 万剑雄, 李雷孝. 车联网区块链分布式车对车计算卸载方法研究[J]. 计算机科学与探索, 2024, 18(7): 1923-1934. |
[4] | 夏庆锋, 许可儿, 李明阳, 胡凯, 宋利鹏, 宋志强, 孙宁. 强化学习中的注意力机制研究综述[J]. 计算机科学与探索, 2024, 18(6): 1457-1475. |
[5] | 王子豪, 钱雪忠, 宋威. 带有惩罚措施的自竞争事后经验重播算法[J]. 计算机科学与探索, 2024, 18(5): 1223-1231. |
[6] | 赵婷婷, 王莹, 孙威, 陈亚瑞, 王嫄, 杨巨成. 潜在空间中的策略搜索强化学习方法[J]. 计算机科学与探索, 2024, 18(4): 1032-1046. |
[7] | 周雅兰, 廖易天, 粟筱, 王甲海. 深度强化学习Memetic算法求解取送货车辆路径问题[J]. 计算机科学与探索, 2024, 18(3): 818-830. |
[8] | 刘晓雪, 姜春茂. 融合强化学习的三支治略选择及其有效性分析[J]. 计算机科学与探索, 2024, 18(2): 378-386. |
[9] | 赵婷婷, 孙威, 陈亚瑞, 王嫄, 杨巨成. 潜在空间中深度强化学习方法研究综述[J]. 计算机科学与探索, 2023, 17(9): 2047-2074. |
[10] | 崔铭, 龚声蓉. 视觉导向的对抗型模仿学习研究综述[J]. 计算机科学与探索, 2023, 17(9): 2075-2091. |
[11] | 韩旭, 吴锋. 结合对比预测的离线元强化学习方法[J]. 计算机科学与探索, 2023, 17(8): 1917-1927. |
[12] | 张立, 段明达, 万剑雄, 李雷孝, 刘楚仪. 车联网区块链吞吐量优化的深度强化学习方法研究[J]. 计算机科学与探索, 2023, 17(7): 1708-1718. |
[13] | 罗逸轩, 刘建华, 胡任远, 张冬阳, 卜冠南. 融合经验共享Q学习的粒子群优化算法[J]. 计算机科学与探索, 2022, 16(9): 2151-2162. |
[14] | 陈共驰, 荣欢, 马廷淮. 面向连贯性强化的无真值依赖文本摘要模型[J]. 计算机科学与探索, 2022, 16(3): 621-636. |
[15] | 王扬, 陈智斌, 吴兆蕊, 高远. 强化学习求解组合最优化问题的研究综述[J]. 计算机科学与探索, 2022, 16(2): 261-279. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||