带有惩罚措施的自竞争事后经验重播算法

doi:10.3778/j.issn.1673-9418.2303031

摘要/Abstract

摘要： 自竞争事后经验重播（SCHER）是在事后经验重播（HER）算法的基础上提出的一种改进策略。HER算法在面对环境奖励稀疏的情况下，通过回放经验生成虚拟有标签数据来优化模型。但HER算法存在两个问题：一是无法处理智能体由于奖励稀疏所产生的大量重复数据，这些无效数据会对经验池造成污染；二是虚拟目标可能会随机选择到一些对完成任务没有帮助的中间状态，导致学习偏差。针对这些问题，SCHER算法提出了两个改进策略：一是增加自适应的奖励信号，对智能体做出的无意义动作进行惩罚，使其快速规避此类操作；二是使用自竞争策略，通过竞争产生针对同一任务下的两组不同数据，对比分析后找到使智能体在不同环境中成功的关键步骤，提高生成虚拟目标的准确程度。实验结果表明，SCHER算法可以更好地利用经验回放技术，将平均任务成功率提高5.7个百分点，拥有更高的准确率和泛化能力。

关键词: 深度强化学习, 稀疏奖励, 经验回放, 自适应奖励信号

Abstract: Self-competitive hindsight experience replay (SCHER) is an improved strategy proposed based on the hindsight experience replay (HER) algorithm. The HER algorithm generates virtual labeled data by replaying experiences to optimize the model in the face of sparse environmental rewards. However, the HER algorithm has two problems: firstly, it cannot handle the large amount of repetitive data generated due to sparse rewards, which contaminates the experience pool; secondly, virtual goals may randomly select intermediate states that are not helpful in completing the task, leading to learning bias. To address these issues, the SCHER algorithm proposes two improvement strategies: firstly, increase the adaptive reward signal to penalize meaningless actions made by agents and quickly avoid such operations; secondly, use self-competition strategy to generate two sets of data for the same task, analyze and compare them, and find the key steps that enable the agent to succeed in different environments, thereby improving the accuracy of generated virtual goals. Experimental results show that the SCHER algorithm can better utilize the experience replay technique, increasing the average task success rate by 5.7 percentage points, and has higher accuracy and generalization ability.

Key words: deep reinforcement learning, sparse reward, experience replay, adaptive reward signal

王子豪, 钱雪忠, 宋威. 带有惩罚措施的自竞争事后经验重播算法[J]. 计算机科学与探索, 2024, 18(5): 1223-1231.

WANG Zihao, QIAN Xuezhong, SONG Wei. Self-competitive Hindsight Experience Replay with Penalty Measures[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1223-1231.

参考文献

[1] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[2] 王扬, 陈智斌, 吴兆蕊, 等. 强化学习求解组合最优化问题的研究综述[J]. 计算机科学与探索, 2022, 16(2): 261-279.
WANG Y, CHEN Z B, WU Z R, et al. Review of reinforcement learning for combinatorial optimization problem[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 261-279.
[3] KUNG T H, CHEATHAM M, MEDENILLA A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models[J]. PLoS Digital Health, 2023, 2(2): e0000198.
[4] BOMMARITO M J, KATZ D M. GPT takes the bar exam [J]. arXiv:2212.14402, 2022.
[5] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[J]. arXiv:2203.02155, 2022.
[6] HU Y, WANG W, JIA H, et al. Learning to utilize shaping rewards: a new approach of reward shaping[J]. arXiv:2011.02669, 2020.
[7] SOVIANY P, IONESCU R T, ROTA P, et al. Curriculum learning: a survey[J]. International Journal of Computer Vision, 2021, 130: 1526-1565.
[8] NAM T, SUN S H, PERTSCH K, et al. Skill-based meta-reinforcement learning[J]. arXiv:2204.11828, 2022.
[9] 韩旭, 吴锋. 结合对比预测的离线元强化学习方法[J]. 计算机科学与探索, 2023, 17(8): 1917-1927.
HAN X, WU F. Offline meta-reinforcement learning with contrastive prediction[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(8): 1917-1927.
[10] ZHOU F, CAO C. Overcoming catastrophic forgetting in graph neural networks with experience replay[C]//Proceedings of the 2021 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2021: 4714-4722.
[11] SAGLAM B, MUTLU F B, CICEK D C, et al. Actor prioritized experience replay[J]. arXiv:2209.00532, 2022.
[12] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 5055-5065.
[13] HE Q, ZHUANG L, LI H. Soft hindsight experience replay [J]. arXiv:2002.02089, 2020.
[14] LIU H, TROTT A, SOCHER R, et al. Competitive experience replay[J]. arXiv:1902.00528, 2019.
[15] NGUYEN H, LA H M, DEANS M C. Hindsight experience replay with experience ranking[C]//Proceedings of the 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics. Piscataway: IEEE, 2019: 1-6.
[16] FANG M, ZHOU T, DU Y, et al. Curriculum-guided hindsight experience replay[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 12602-12613.
[17] SCHRAMM L, DENG Y, GRANADOS E, et al. USHER: unbiased sampling for hindsight experience replay[J]. arXiv:2207.01115, 2022.
[18] LUU T M, YOO C D. Hindsight goal ranking on replay buffer for sparse reward environment[J]. IEEE Access, 2021, 9: 51996-52007.
[19] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[C]//Proceedings of the 4th International Conference on Learning Representations, San Juan, May 2-4, 2016.
[20] ZHANG J, HE T, SRA S, et al. Why gradient clipping accelerates training: a theoretical justification for adaptivity[C]// Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Apr 26-30, 2020.
[21] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 488-489.
[22] LEVINE S, KUMAR A, TUCKER G, et al. Offline reinforcement learning: tutorial, review, and perspectives on open problems[J]. arXiv:2005.01643, 2020.