计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (8): 2080-2090.DOI: 10.3778/j.issn.1673-9418.2310040

• 理论·算法 • 上一篇    下一篇

改进MADDPG算法的非凸环境下多智能体自组织协同围捕

张红强,石佳航,吴亮红,王汐,左词立,陈祖国,刘朝华,陈磊   

  1. 湖南科技大学 信息与电气工程学院,湖南 湘潭 411201
  • 出版日期:2024-08-01 发布日期:2024-07-29

Multi-agent Self-organizing Cooperative Hunting in Non-convex Environment with Improved MADDPG Algorithm

ZHANG Hongqiang, SHI Jiahang, WU Lianghong, WANG Xi, ZUO Cili, CHEN Zuguo, LIU Zhaohua, CHEN Lei   

  1. School of Information and Electrical Engineering, Hunan University of Science and Technology, Xiangtan, Hunan  411201, China
  • Online:2024-08-01 Published:2024-07-29

摘要: 针对多智能体在非凸环境下的围捕效率问题,提出基于改进经验回放的多智能体强化学习算法。利用残差网络(ResNet)来改善网络退化问题,并与多智能体深度确定性策略梯度算法(MADDPG)相结合,提出了RW-MADDPG算法。为解决多智能体在训练过程中,经验池数据利用率低的问题,提出两种改善经验池数据利用率的方法;为解决多智能体在非凸障碍环境下陷入障碍物内部的情况(如陷入目标不可达等),通过设计合理的围捕奖励函数使得智能体在非凸障碍物环境下完成围捕任务。基于此算法设计仿真实验,实验结果表明,该算法在训练阶段奖励增加得更快,能更快地完成围捕任务,相比MADDPG算法静态围捕环境下训练时间缩短18.5%,动态环境下训练时间缩短49.5%,而且在非凸障碍环境下该算法训练的围捕智能体的全局平均奖励更高。

关键词: 深度强化学习, RW-MADDPG, 残差网络, 经验池, 围捕奖励函数

Abstract: A multi-agent reinforcement learning algorithm based on improved experience playback is proposed to solve the trapping efficiency problem of multi-agent in non-convex environment. The residual network (ResNet) is used to improve the network degradation problem, and the RW-MADDPG algorithm combined with the multi-agent depth deterministic strategy gradient algorithm (MADDPG) is proposed. In order to solve the problem of low utilization of experience pool data during multi-agent training, two methods to improve the utilization of experience pool data are proposed. In order to solve the problem that multiple agents are trapped inside obstacles such as unreachable target in non-convex obstacle environment, a reasonable trapping reward function is designed to make intelligent agents complete the trapping task in non-convex obstacle environment. Simulation experiments are designed based on this algorithm. Experimental results show that the algorithm increases the reward faster in the training stage and can complete the rounding task faster. Compared with MADDPG algorithm, the training time is shortened by 18.5% under static rounding environment and 49.5% under dynamic environment. Moreover, the global average reward of the rounding agent trained by this algorithm is higher in the non-convex obstacle environment.

Key words: deep reinforcement learning, RW-MADDPG, residual network, experience pool, rounding reward function