基于潜在状态分布GPT的离线多智能体强化学习方法

doi:10.3778/j.issn.1673-9418.2307034

摘要/Abstract

摘要： 通过决策Transformer对基础模型进行离线预训练可以有效地解决在线多智能体强化学习采样效率低和可扩展性的问题，但这种生成预训练方法在个体奖励难以定义和数据集不能覆盖最优策略的多智能体任务中表现不佳。针对此问题，采用潜在状态分布改进决策Transformer，提出了一种融合离线预训练和在线微调的多智能体强化学习算法。该算法利用自编码器和独热编码方法生成离散的潜在状态表示，保留了原始状态空间中某些重要的信息；通过潜在的临时抽象改进生成式预训练的决策Transformer，类似于数据增益的技术，在一定程度上解决了未充分覆盖状态空间的离线数据集导致的外推误差问题；采用集中训练和分散执行的方式解决在线微调时智能体的信度分配问题；通过鼓励探索的多智能体策略梯度算法在下游任务中进一步探索协同策略。在星际争霸仿真平台上进行实验，与基线算法相比，在较少甚至没有离线轨迹数据的任务中得分更高，泛化能力更强。

关键词: 离线多智能体强化学习, 分布式学习, 表示学习, 大语言模型

Abstract: Offline pre-training of the basic model through decision Transformer can effectively solve the problems of low sampling efficiency and scalability of online multi-agent reinforcement learning, but this generative pre-training Transformer method performs poorly in multi-agent tasks where individual rewards are difficult to define and the dataset cannot cover the optimal strategy. To solve this problem, a multi-agent reinforcement learning algorithm integrating offline pre-training and online fine-tuning is proposed by using latent state distribution to improve the decision Transformer. The algorithm uses autoencoder and one-hot coding methods to generate discrete latent state representations, which retain some important information in the original state space. The decision Transformer of generative pre-training is improved through latent temporary abstraction, similar to the data gain technique, which solves the problem of extrapolation error caused by offline datasets that do not fully cover the state space to a certain extent. Centralized training and decentralized execution are used to solve the reliability distribution problem of agents during online fine-tuning. Through the multi-agent policy gradient algorithm that encourages exploration, the collaborative strategy is further explored in downstream tasks. Finally, experiments are carried out on the StarCraft simulation platform, and compared with the baseline algorithm, the scores are higher and the generalization ability is stronger in tasks with less or even no offline trajectory data.

Key words: offline multi-agent reinforcement learning, distributed learning, representation learning, large language model

盛蕾, 陈希亮, 赖俊. 基于潜在状态分布GPT的离线多智能体强化学习方法[J]. 计算机科学与探索, 2024, 18(8): 2169-2179.

SHENG Lei, CHEN Xiliang, LAI Jun. Offline Multi-agent Reinforcement Learning Method Based on Latent State Distribution GPT[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 2169-2179.

参考文献

[1] DU X, WANG J, CHEN S, et al. Multi-agent deep reinforcement learning with spatio-temporal feature fusion for traffic signal control[C]//Proceedings of the 2021 European Conference on Machine Learning and Knowledge Discovery in Databases, Applied Data Science Track, Bilbao, Sep 13-17, 2021. Cham: Springer, 2021: 470-485.
[2] LI M, QIN Z, JIAO Y, et al. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning[C]//Proceedings of the 2019 World Wide Web Conference, San Francisco, May 13-17, 2019. New York: ACM, 2019: 983-994.
[3] ZHOU M, WAN Z, WANG H, et al. MALib: a parallel framework for population-based multi-agent reinforcement learning[J]. Journal of Machine Learning Research, 2023, 24.
[4] SINGH B, KUMAR R, SINGH V P. Reinforcement learning in robotic applications: a comprehensive survey[J]. Artificial Intelligence Review, 2022, 55: 1-46.
[5] SINGLA A, RAFFERTY A N, RADANOVIC G, et al. Reinforcement learning for education: opportunities and challenges[EB/OL]. [2023-05-23]. https://arxiv.org/abs/2107.08828.
[6] LIU S, SEE K C, NGIAM K Y, et al. Reinforcement learning for clinical decision support in critical care: comprehensive review[J]. Journal of Medical Internet Research, 2020, 22(7): e18477.
[7] KIRAN B R, SOBH I, TALPAERT V, et al. Deep reinforcement learning for autonomous driving: a survey[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 23(6): 4909-4926.
[8] FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep reinforcement learning without exploration[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2052-2062.
[9] PENG X B, KUMAR A, ZHANG G, et al. Advantage-weighted regression simple and scalable off-policy reinforc-ement learning[EB/OL]. [2023-05-23]. https://arxiv.org/abs/1910.00177.
[10] WU Y, TUCKER G, NACHUM O. Behavior regularized off-line reinforcement learning[EB/OL]. [2023-05-23]. https://arxiv.org/abs/1911.11361v1.
[11] KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-learning for offline reinforcement learning[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1179-1191.
[12] WU Y, ZHAI S, SRIVASTAVA N, et al. Uncertainty weighted actor-critic for offline reinforcement learning[C]//Proceedings of the 38th International Conference on Machine Lear-ning, Jul 18-24, 2021: 11319-11328.
[13] YANG Y, MA X, LI C, et al. Believe what you see: implicit constraint approach for offline multi-agent reinforcement learning[C]//Advances in Neural Information Processing Syst-ems 34, Dec 6-14, 2021: 10299-10312.
[14] WEN M, KUBA J, LIN R, et al. Multi-agent reinforcement learning is a sequence modeling problem[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 16509-16521.
[15] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901.
[16] CHEN M, RADFORD A, CHILD R, et al. Generative pretraining from pixels[C]//Proceedings of the 37th International Conference on Machine Learning, Jul 13-18, 2020: 1691-1703.
[17] LU K, GROVER A, ABBEEL P, et al. Pretrained transformers as universal computation engines[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence, the 34th Conference on Innovative Applications of Artificial Intelligence, the 12th Symposium on Educational Advances in Artificial Intelligence, Feb 22-Mar 1, 2022: 7628-7636.
[18] FURUTA H, MATSUO Y, GU S S. Generalized decision transformer for offline hindsight information matching[C]//Proceedings of the 10th International Conference on Learning Representations, Apr 25-29, 2022.
[19] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[C]//Proceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval, Bangkok, Sep 7-9, 2018. New York: ACM, 2018: 6-10.
[20] KENTON J D M W C, TOUTANOVA L K. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Jun 2-7, 2019. Stroudsburg: ACL, 2019: 4171-4186.
[21] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 10012-10022.
[22] ZHAI X, KOLESNIKOV A, HOULSBY N, et al. Scaling vision transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-21, 2022. Piscataway: IEEE, 2022: 12104-12113.
[23] PARISOTTO E, SONG F, RAE J, et al. Stabilizing transformers for reinforcement learning[C]//Proceedings of the 37th International conference on Machine Learning, Jul 13-18, 2020: 7487-7498.
[24] CHEN L, LU K, RAJESWARAN A, et al. Decision transformer: reinforcement learning via sequence modeling[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 15084-15097.
[25] JANNER M, LI Q, LEVINE S. Offline reinforcement learning as one big sequence modeling problem[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 1273-1286.
[26] DASARI S, GUPTA A. Transformers for one-shot visual imitation[C]//Proceedings of the 2021 Conference on Robot Learning, London, Nov 8-11, 2021: 2071-2084.
[27] ZHANG K, YANG Z, LIU H, et al. Finite-sample analysis for decentralized batch multiagent reinforcement learning with networked agents[J]. IEEE Transactions on Automatic Control, 2021, 66(12): 5925-5940.
[28] PAN L, HUANG L, MA T, et al. Plan better amid conservatism: offline multi-agent reinforcement learning with actor rectification[C]//Proceedings of the 2022 International Conference on Machine Learning, Maryland, Jul 17-23, 2022: 17221-17237.
[29] MENG L, WEN M, LE C, et al. Offline pre-trained multi-agent decision transformer[J]. Machine Intelligence Research, 2023, 20(2): 233-248.
[30] TSENG W C, WANG T H J, LIN Y C, et al. Offline multi-agent reinforcement learning with knowledge distillation[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 226-237.
[31] ABEL D, HERSHKOWITZ D, LITTMAN M. Near optimal behavior via approximate state abstraction[C]//Proceedings of the 2016 International Conference on Machine Lear-ning, New York, Jun 19-24, 2016: 2915-2923.
[32] NACHUM O, GU S, LEE H, et al. Near-optimal representation learning for hierarchical reinforcement learning[C]//Proceedings of the 2018 International Conference on Learning Representations, Vancouver, Apr 30-May 3, 2018: 1-7.
[33] HAFNER D, LILLICRAP T P, NOROUZI M, et al. Mastering atari with discrete world models[C]//Proceedings of the 2020 International Conference on Learning Representations, Apr 27-30, 2020: 7-15.
[34] LEVINE N, CHOW Y, SHU R, et al. Prediction, consistency, curvature: representation learning for locally-linear control[EB/OL]. [2023-05-23]. https://arxiv.org/abs/1909.01506.
[35] YANG M, NACHUM O. Representation matters: offline pretraining for sequential decision making[C]//Proceedings of the 2021 International Conference on Machine Learning, Eindhoven, Aug 1-5, 2021: 11784-11794.
[36] STOOKE A, LEE K, ABBEEL P, et al. Decoupling representation learning from reinforcement learning[C]//Proceedings of the 2021 International Conference on Machine Learning, Eindhoven, Aug 1-5, 2021: 9870-9879.
[37] KUMAR A, HONG J, SINGH A, et al. Should I run offline reinforcement learning or behavioral cloning?[C]//Proceedings of the 10th International Conference on Learning Representations, Apr 25-29, 2022: 15-51.