计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (8): 2169-2179.DOI: 10.3778/j.issn.1673-9418.2307034

• 人工智能·模式识别 • 上一篇    下一篇

基于潜在状态分布GPT的离线多智能体强化学习方法

盛蕾,陈希亮,赖俊   

  1. 中国人民解放军陆军工程大学 指挥控制工程学院,南京 210007
  • 出版日期:2024-08-01 发布日期:2024-07-29

Offline Multi-agent Reinforcement Learning Method Based on Latent State Distribution GPT

SHENG Lei, CHEN Xiliang, LAI Jun   

  1. College of Command and Control Engineering, Army Engineering University of PLA, Nanjing 210007, China
  • Online:2024-08-01 Published:2024-07-29

摘要: 通过决策Transformer对基础模型进行离线预训练可以有效地解决在线多智能体强化学习采样效率低和可扩展性的问题,但这种生成预训练方法在个体奖励难以定义和数据集不能覆盖最优策略的多智能体任务中表现不佳。针对此问题,采用潜在状态分布改进决策Transformer,提出了一种融合离线预训练和在线微调的多智能体强化学习算法。该算法利用自编码器和独热编码方法生成离散的潜在状态表示,保留了原始状态空间中某些重要的信息;通过潜在的临时抽象改进生成式预训练的决策Transformer,类似于数据增益的技术,在一定程度上解决了未充分覆盖状态空间的离线数据集导致的外推误差问题;采用集中训练和分散执行的方式解决在线微调时智能体的信度分配问题;通过鼓励探索的多智能体策略梯度算法在下游任务中进一步探索协同策略。在星际争霸仿真平台上进行实验,与基线算法相比,在较少甚至没有离线轨迹数据的任务中得分更高,泛化能力更强。

关键词: 离线多智能体强化学习, 分布式学习, 表示学习, 大语言模型

Abstract: Offline pre-training of the basic model through decision Transformer can effectively solve the problems of low sampling efficiency and scalability of online multi-agent reinforcement learning, but this generative pre-training Transformer method performs poorly in multi-agent tasks where individual rewards are difficult to define and the dataset cannot cover the optimal strategy. To solve this problem, a multi-agent reinforcement learning algorithm integrating offline pre-training and online fine-tuning is proposed by using latent state distribution to improve the decision Transformer. The algorithm uses autoencoder and one-hot coding methods to generate discrete latent state representations, which retain some important information in the original state space. The decision Transformer of generative pre-training is improved through latent temporary abstraction, similar to the data gain technique, which solves the problem of extrapolation error caused by offline datasets that do not fully cover the state space to a certain extent. Centralized training and decentralized execution are used to solve the reliability distribution problem of agents during online fine-tuning. Through the multi-agent policy gradient algorithm that encourages exploration, the collaborative strategy is further explored in downstream tasks. Finally, experiments are carried out on the StarCraft simulation platform, and compared with the baseline algorithm, the scores are higher and the generalization ability is stronger in tasks with less or even no offline trajectory data.

Key words: offline multi-agent reinforcement learning, distributed learning, representation learning, large language model