基于隐式通信的值分解多智能体强化学习

doi:10.3778/j.issn.1673-9418.2409028

摘要/Abstract

摘要： 在多智能体系统中，智能体通常只能观察到部分状态信息，导致每个智能体在作决策时缺乏对其他智能体行为和环境动态的完整理解，进而增加了协作的难度。虽然基于值函数分解的多智能体强化学习方法对解决局部可观测性问题有一定的优势，但由于状态-动作空间维度高、模型结构复杂等问题，多智能体系统中仍然存在着协作不确定性的影响，从而导致奖励分配不公平的问题。提出了一种基于隐式通信的值分解多智能体强化学习方法（VFRL-IC），通过挖掘智能体之间的局部关系，缓解环境不确定性问题带来的影响：提出隐式通信框架，在训练阶段使智能体共享局部观测信息以训练局部策略；基于局部观测信息构建全局影响的评估模型，求解智能体间影响值；设计了一种类多头注意力机制的网络结构，融合智能体间影响值，求解包含全局信息的局部动作值模型。在星际争霸环境中进行实验验证，结果表明，VFRL-IC在各场景中的平均成功率优于基线算法15~40个百分点，效率提高18%。

关键词: 值分解, 多智能体强化学习, 部分可观测性, 不确定性, 隐式通信

Abstract: In multi-agent systems, agents typically can only observe partial state information, resulting in each agent making decisions without a complete understanding of the behavior of the other agents and the dynamics of the environment, which in turn increases the difficulty of collaboration. Although the multi-agent reinforcement learning method based on value function factorization has some advantages in solving the local observability problem, there still exists the effect of collaborative uncertainty in multi-agent systems due to high-dimensional state-action space, the complexity of model structure, and so on, which leads to unfair reward assignment. This paper proposes a value function factorization multi-agent reinforcement learning method based on implicit communication (VFRL-IC) to address the uncertainty problem in multi-agent systems by exploring the local relationships between agents. Firstly, an implicit communication framework is proposed to share the local observation information of agents with others during training in order to train the local policy. Secondly, an assessment model of global influence based on local observations of all agents is constructed to solve for inter-agent influence values. Finally, a multi-head attention-like mechanisms-based network structure is designed to solve the local action value model containing global information by fusing inter-agent influence values. Extensive experiments are conducted in StarCraft II to validate the proposed method. The results indicate that VFRL-IC outperforms baselines by 15 to 40 percentage points in average success rate across different scenarios, while improving efficiency by 18%.

Key words: value function factorization, multi-agent reinforcement learning, partial observability, uncertainty, implicit communication

邓亚男, 王秋红, 李俊杰, 顾晶晶. 基于隐式通信的值分解多智能体强化学习[J]. 计算机科学与探索, 2025, 19(7): 1878-1887.

DENG Yanan, WANG Qiuhong, LI Junjie, GU Jingjing. Value Function Factorization for Multi-agent Reinforcement Learning Based on Implicit Communication[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1878-1887.

参考文献

[1] GUPTA J K, EGOROV M, KOCHENDERFER M. Cooperative multi-agent control using deep reinforcement learning[C]//Autonomous Agents and Multiagent Systems: AAMAS 2017 Workshops. Cham: Springer, 2017: 66-83.
[2] NGUYEN T T, NGUYEN N D, NAHAVANDI S. Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications[J]. IEEE Transactions on Cybernetics, 2020, 50(9): 3826-3839.
[3] 徐诚, 殷楠, 段世红, 等. 基于奖励滤波信用分配的多智能体深度强化学习算法[J]. 计算机学报, 2022, 45(11): 2306-2320.
XU C, YIN N, DUAN S H, et al. Reward-filtering-based credit assignment for multi-agent deep reinforcement learning[J]. Chinese Journal of Computers, 2022, 45(11): 2306-2320.
[4] CAO Y C, YU W W, REN W, et al. An overview of recent progress in the study of distributed multi-agent coordination[J]. IEEE Transactions on Industrial Informatics, 2013, 9(1): 427-438.
[5] CUI J J, LIU Y W, NALLANATHAN A. Multi-agent reinforcement learning-based resource allocation for UAV networks[J]. IEEE Transactions on Wireless Communications, 2020, 19(2): 729-743.
[6] LIU X L, YU J D, FENG Z Y, et al. Multi-agent reinforcement learning for resource allocation in IoT networks with edge computing[J]. China Communications, 2020, 17(9): 220-236.
[7] 李静晨, 史豪斌, 黄国胜. 基于自注意力机制和策略映射重组的多智能体强化学习算法[J]. 计算机学报, 2022, 45(9): 1842-1858.
LI J C, SHI H B, HUANG G S. A multi-agent reinforcement learning method based on self-attention mechanism and policy mapping recombination[J]. Chinese Journal of Computers, 2022, 45(9): 1842-1858.
[8] SCH?LLIG A, ALONSO-MORA J, D’ANDREA R. Independent vs. joint estimation in multi-agent iterative learning control[C]//Proceedings of the 49th IEEE Conference on Decision and Control. Piscataway: IEEE, 2010: 6949-6954.
[9] POSOR J E, BELZNER L, KNAPP A. Joint action learning for multi-agent cooperation using recurrent reinforcement learning[J]. Digitale Welt, 2020, 4(1): 79-84.
[10] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2974-2982.
[11] LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[EB/OL]. [2024-07-15]. https://arxiv.org/abs/1706.02275.
[12] SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning[EB/OL]. [2024-07-15]. https://arxiv.org/abs/1706.05296.
[13] RASHID T, SAMVELYAN M, DE WITT C S, et al. Monotonic value function factorisation for deep multi-agent reinforcement learning[EB/OL]. [2024-07-15]. https://arxiv.org/abs/2003.08839.
[14] WANG J H, REN Z Z, LIU T, et al. QPLEX: duplex dueling multi-agent Q-learning[EB/OL]. [2024-07-15]. https://arxiv.org/abs/2008.01062.
[15] SAMVELYAN M, RASHID T, DE WITT C S, et al. The StarCraft multi-agent challenge[EB/OL]. [2024-07-15]. https://arxiv.org/abs/1902.04043.
[16] SU J Y, ADAMS S, BELING P. Value-decomposition multi-agent actor-critics[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(13): 11352-11360.
[17] YU C, VELU A, VINITSKY E, et al. The surprising effectiveness of PPO in cooperative, multi-agent games[EB/OL]. [2024-07-18]. https://arxiv.org/abs/2103.01955.
[18] DING Z, HUANG T, LU Z. Learning individually inferred communication for multi-agent cooperation[C]//Advances in Neural Information Processing Systems 33, 2020: 22069-22079.
[19] DAS A, GERVET T, ROMOFF J, et al. TarMAC: targeted multi-agent communication[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1810.11187.
[20] SINGH A, JAIN T, SUKHBAATAR S. Learning when to communicate at scale in multiagent cooperative and competitive tasks[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1812.09755.
[21] YUAN L, WANG J H, ZHANG F X, et al. Multi-agent incentive communication via decentralized teammate modeling[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(9): 9466-9474.
[22] WANG T H, WANG J H, ZHENG C Y, et al. Learning nearly decomposable value functions via communication minimization[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1910.05366.
[23] XU D, CHEN G. The research on intelligent cooperative combat of UAV cluster with multi-agent reinforcement learning[J]. Aerospace Systems, 2022, 5(1): 107-121.
[24] SUTTON R S. Temporal credit assignment in reinforcement learning[D]. Amherst: University of Massachusetts Amherst, 1984.
[25] TAMPUU A, MATIISEN T, KODELJA D, et al. Multiagent cooperation and competition with deep reinforcement learning[J]. PLoS One, 2017, 12(4): e0172395.
[26] SON K, KIM D, KANG W J, et al. QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning, 2019: 5887-5896.
[27] MAHAJAN A, RASHID T, SAMVELYAN M, et al. MAVEN: multi-agent variational exploration[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1910.07483.
[28] YANG Y D, HAO J Y, LIAO B, et al. Qatten: a general framework for cooperative multiagent reinforcement learning[EB/OL]. [2024-07-18]. https://arxiv.org/abs/2002.03939.
[29] DA SILVA F L, HERNANDEZ-LEAL P, KARTAL B, et al. Uncertainty-aware action advising for deep reinforcement learning agents[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 5792-5799.
[30] LüTJENS B, EVERETT M, HOW J P. Safe reinforcement learning with model uncertainty estimates[C]//Proceedings of the 2019 International Conference on Robotics and Automation. Piscataway: IEEE, 2019: 8662-8668.
[31] WANG Y, ZOU S. Online robust reinforcement learning with model uncertainty[C]//Advances in Neural Information Processing Systems 34, 2021: 7193-7206.
[32] ZHANG K, SUN T, TAO Y, et al. Robust multi-agent reinforcement learning with model uncertainty[C]//Advances in Neural Information Processing Systems 33, 2020: 10571-10583.
[33] GAO X, LI X Y, LIU Q, et al. Multi-agent decision-making modes in uncertain interactive traffic scenarios via graph convolution-based deep reinforcement learning[J]. Sensors, 2022, 22(12): 4586.
[34] TANG B H, ZHONG Y Q, XU C X, et al. Collaborative uncertainty benefits multi-agent multi-modal trajectory forecasting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 13297-13313.
[35] SUKHBAATAR S, SZLAM A, FERGUS R. Learning multiagent communication with backpropagation[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1605.07736.
[36] MAO H Y, ZHANG Z C, XIAO Z, et al. Learning multi-agent communication with double attentional deep reinforcement learning[J]. Autonomous Agents and Multi-Agent Systems, 2020, 34(1): 32.
[37] OLIEHOEK F A, AMATO C. A concise introduction to decentralized POMDPs[M]. Cham: Springer, 2016.
[38] WANG J, ZHANG Y, GU Y, et al. SHAQ: incorporating Shapley value theory into multi-agent Q-learning[C]//Advances in Neural Information Processing Systems 35, 2022: 5941-5954.