Multi-agent Self-organizing Cooperative Hunting in Non-convex Environment with Improved MADDPG Algorithm

doi:10.3778/j.issn.1673-9418.2310040

Abstract

Abstract: A multi-agent reinforcement learning algorithm based on improved experience playback is proposed to solve the trapping efficiency problem of multi-agent in non-convex environment. The residual network (ResNet) is used to improve the network degradation problem, and the RW-MADDPG algorithm combined with the multi-agent depth deterministic strategy gradient algorithm (MADDPG) is proposed. In order to solve the problem of low utilization of experience pool data during multi-agent training, two methods to improve the utilization of experience pool data are proposed. In order to solve the problem that multiple agents are trapped inside obstacles such as unreachable target in non-convex obstacle environment, a reasonable trapping reward function is designed to make intelligent agents complete the trapping task in non-convex obstacle environment. Simulation experiments are designed based on this algorithm. Experimental results show that the algorithm increases the reward faster in the training stage and can complete the rounding task faster. Compared with MADDPG algorithm, the training time is shortened by 18.5% under static rounding environment and 49.5% under dynamic environment. Moreover, the global average reward of the rounding agent trained by this algorithm is higher in the non-convex obstacle environment.

Key words: deep reinforcement learning, RW-MADDPG, residual network, experience pool, rounding reward function

摘要： 针对多智能体在非凸环境下的围捕效率问题，提出基于改进经验回放的多智能体强化学习算法。利用残差网络（ResNet）来改善网络退化问题，并与多智能体深度确定性策略梯度算法（MADDPG）相结合，提出了RW-MADDPG算法。为解决多智能体在训练过程中，经验池数据利用率低的问题，提出两种改善经验池数据利用率的方法；为解决多智能体在非凸障碍环境下陷入障碍物内部的情况（如陷入目标不可达等），通过设计合理的围捕奖励函数使得智能体在非凸障碍物环境下完成围捕任务。基于此算法设计仿真实验，实验结果表明，该算法在训练阶段奖励增加得更快，能更快地完成围捕任务，相比MADDPG算法静态围捕环境下训练时间缩短18.5%，动态环境下训练时间缩短49.5%，而且在非凸障碍环境下该算法训练的围捕智能体的全局平均奖励更高。

关键词: 深度强化学习, RW-MADDPG, 残差网络, 经验池, 围捕奖励函数

ZHANG Hongqiang, SHI Jiahang, WU Lianghong, WANG Xi, ZUO Cili, CHEN Zuguo, LIU Zhaohua, CHEN Lei. Multi-agent Self-organizing Cooperative Hunting in Non-convex Environment with Improved MADDPG Algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 2080-2090.

张红强, 石佳航, 吴亮红, 王汐, 左词立, 陈祖国, 刘朝华, 陈磊. 改进MADDPG算法的非凸环境下多智能体自组织协同围捕[J]. 计算机科学与探索, 2024, 18(8): 2080-2090.

References

[1] DONG X Y, YAN T R, LV Y, et al. Multi-agent coordinated control and collision avoidance with unknown disturbances[J]. Transactions of Nanjing University of Aeronautics & Astronautics, 2022(2): 176-185.
[2] SONG Y, STEINWEG M, KAUFMANN E, et al. Autonomous drone racing with deep reinforcement learning[C]//Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, Prague, Sep 27-Oct 1, 2021. Piscataway: IEEE, 2021: 1205-1212.
[3] 杨庆玉. 基于深度强化学习的多智能体搬运调度方法研究[D]. 秦皇岛: 燕山大学, 2022.
YANG Q Y. Research on multiagent handling scheduling method based on deep reinforcement learning[D]. Qinhuangdao: Yanshan University, 2022.
[4] 黄天云, 陈雪波, 徐望宝, 等. 基于松散偏好规则的群体机器人系统自组织协作围捕[J]. 自动化学报, 2013, 39(1): 57-68.
HUANG T Y, CHEN X B, XU W B, et al. A self-organizing cooperative hunting by swarm robotic systems based on loose-preference rule[J]. Acta Automatica Sinica, 2013, 39(1): 57-68.
[5] 李瑞珍, 杨惠珍, 萧丛杉. 基于动态围捕点的多机器人协同策略[J]. 控制工程, 2019, 26(3): 510-514.
LI R Z, YANG H Z, XIAO C S. Multi-robot cooperative strategy based on dynamic trapping points[J]. Control Engineering of China, 2019, 26(3): 510-514.
[6] 蒋骁迪, 甘文洋. 一种新型多AUV水下围捕路径规划算法[J]. 计算机仿真, 2021, 38(9): 376-380.
JIANG X D, GAN W Y. A novel multi-AUV underwater trapping path planning algorithm[J]. Computer Simulation, 2021, 38(9): 376-380.
[7] 刘彦昊, 佘浩平, 蒙波, 等. 基于狼群优化的卫星集群对空间目标围捕方法[J/OL]. 北京航空航天大学学报 [2023-09-25]. https://doi.org/10.13700/j.bh.1001-5965.2022.0877.
LIU Y H, SHE H P, MENG B, et al. Satellite clustering method based on wolf pack optimization to capture space targets[J/OL]. Journal of Beijing University of Aeronautics and Astronautics [2023-09-25]. https://doi.org/10.13700/j.bh.1001-5965.2022.0877.
[8] SUTTON R S, BARTO A G, BACH F, et al. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 1998.
[9] LI J, PAN Q, HONG B. A new approach of multi-robot cooperative pursuit based on association rule data mining[J]. International Journal of Advanced Robotic Systems, 2010, 7(3): 1169-1174.
[10] LIU J, LIU S H, WU H Y, et al. A pursuit-evasion algorithm based on hierarchical reinforcement learning[C]//Proceedings of the 2009 International Conference on Measuring Technology and Mechatronics Automation, Zhangjiajie, Apr 11-12, 2009. Piscataway: IEEE, 2009: 482-486.
[11] MOSTAFA D, HOWARD M. A decentralized fuzzy learning algorithm for pursuit-evasion differential games with superior evaders[J]. Journal of Intelligent and Robotic Systems, 2016, 83(1): 35-53.
[12] LAUER M, RIEDMILLER M. An algorithm for distributed reinforcement learning incooperative multi-agent systems[C]//Proceedings of the 17th International Conference on Machine Learning, Stanford, Jun 29-Jul 2, 2000. New York: ACM, 2000: 535-542.
[13] LOWE R, WU Y I, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 6379-6390.
[14] ZHOU X, ZHOU S, MOU X, et al. Multirobot collaborative pursuit target robot by improved MADDPG[EB/OL]. (2022-02-25) [2023-09-25]. https://www.hindawi.com/journals/cin/2022/4757394/.
[15] ZHANG Z, ZHOU B, LI G, et al. Dual-layer distributed optimal operation method for island microgrid based on adaptive consensus control and two stage MATD3 algorithm[J]. Journal of Marine Science and Engineering, 2023, 11(6): 1201.
[16] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients[C]//Proceedings of the 2018 AAAI Conference on Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park：AAAI, 2018: 2974-2982.
[17] 刘峰, 魏瑞, 丁超, 等. 面向多机协同的Att-MADDPG围捕控制方法设计[J]. 空军工程大学学报(自然科学版), 2021, 22(3): 9-14.
LIU F, WEI R, DING C, et al. Design of Att-MADDPG round up control method for multi-aircraft coordination[J]. Journal of Air Force Engineering University (Natural Science Edition), 2021, 22(3): 9-14.
[18] 王凤英, 陈莹, 袁帅, 等. 自注意力机制结合DDPG的机器人路径规划研究[J/OL]. 计算机工程与应用 [2023-10-10]. http://kns.cnki.net/kcms/detail/11.2127.TP.20230920.0937. 010.html.
WANG F Y, CHEN Y, YUAN S, et al. Research on robot path planning based on self-attention mechanism combined with DDPG[J/OL]. Computer Engineering and Applications [2023-10-10]. http://kns.cnki.net/kcms/detail/11.2127.TP.20230920.0937.010.html.
[19] SCHAUL T, QUAN J, ANTONOGLOU I, et al. Prioritized experience replay[EB/OL]. [2023-09-25]. https://arxiv.org/abs/1511.05952.
[20] MA J C, LU H M, XIAO J H, et al. Multi-robot target encirclement control with collision avoidance via deep reinforcement learning[J]. Journal of Intelligent & Robotic Systems, 2020, 99: 371-386.
[21] 符小卫, 徐哲, 朱金冬, 等. 基于PER-MATD3的多无人机攻防对抗机动决策[J]. 航空学报, 2023, 44(7): 196-209.
FU X W, XU Z, ZHU J D, et al. Offensive and defensive adversarial maneuver decision of multi-UAV based on PER-MATD3[J]. Acta Aeronautica et Astronautica Sinica, 2023, 44(7): 196-209.
[22] 孙彧, 徐越, 潘宣宏, 等. 基于后验经验回放的MADDPG算法[J]. 指挥信息系统与技术, 2021, 12(6): 78-84.
SUN Y, XU Y, PAN X H, et al. MADDPG algorithm based on posterior experience playback[J]. Command Information System and Technology, 2021, 12(6): 78-84.
[23] 郭玥秀, 杨伟, 刘琦, 等. 残差网络研究综述[J]. 计算机应用研究, 2020, 37(5): 1292-1297.
GUO Y X, YANG W, LIU Q, et al. Review of residual network research[J]. Application Research of Computers, 2020, 37(5): 1292-1297.
[24] SUI D, XU W P, ZHANG K. Study on the resolution of multi-aircraft flight conflicts based on an IDQN[J]. Chinese Journal of Aeronautics, 2022， 35(2): 195-213.