计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (5): 1223-1231.DOI: 10.3778/j.issn.1673-9418.2303031

• 理论·算法 • 上一篇    下一篇

带有惩罚措施的自竞争事后经验重播算法

王子豪,钱雪忠,宋威   

  1. 江南大学 人工智能与计算机学院,江苏 无锡 214122
  • 出版日期:2024-05-01 发布日期:2024-04-29

Self-competitive Hindsight Experience Replay with Penalty Measures

WANG Zihao, QIAN Xuezhong, SONG Wei   

  1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2024-05-01 Published:2024-04-29

摘要: 自竞争事后经验重播(SCHER)是在事后经验重播(HER)算法的基础上提出的一种改进策略。HER算法在面对环境奖励稀疏的情况下,通过回放经验生成虚拟有标签数据来优化模型。但HER算法存在两个问题:一是无法处理智能体由于奖励稀疏所产生的大量重复数据,这些无效数据会对经验池造成污染;二是虚拟目标可能会随机选择到一些对完成任务没有帮助的中间状态,导致学习偏差。针对这些问题,SCHER算法提出了两个改进策略:一是增加自适应的奖励信号,对智能体做出的无意义动作进行惩罚,使其快速规避此类操作;二是使用自竞争策略,通过竞争产生针对同一任务下的两组不同数据,对比分析后找到使智能体在不同环境中成功的关键步骤,提高生成虚拟目标的准确程度。实验结果表明,SCHER算法可以更好地利用经验回放技术,将平均任务成功率提高5.7个百分点,拥有更高的准确率和泛化能力。

关键词: 深度强化学习, 稀疏奖励, 经验回放, 自适应奖励信号

Abstract: Self-competitive hindsight experience replay (SCHER) is an improved strategy proposed based on the hindsight experience replay (HER) algorithm. The HER algorithm generates virtual labeled data by replaying experiences to optimize the model in the face of sparse environmental rewards. However, the HER algorithm has two problems: firstly, it cannot handle the large amount of repetitive data generated due to sparse rewards, which contaminates the experience pool; secondly, virtual goals may randomly select intermediate states that are not helpful in completing the task, leading to learning bias. To address these issues, the SCHER algorithm proposes two improvement strategies: firstly, increase the adaptive reward signal to penalize meaningless actions made by agents and quickly avoid such operations; secondly, use self-competition strategy to generate two sets of data for the same task, analyze and compare them, and find the key steps that enable the agent to succeed in different environments, thereby improving the accuracy of generated virtual goals. Experimental results show that the SCHER algorithm can better utilize the experience replay technique, increasing the average task success rate by 5.7 percentage points, and has higher accuracy and generalization ability.

Key words: deep reinforcement learning, sparse reward, experience replay, adaptive reward signal