Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

Formal Verification of Spatio-Temporal Rules Guided Safe Reinforcement Learning for CPS

YIN Chan, ZHU Yi, WANG Jinyong, CHEN Xiaoying, HAO Guosheng   

  1. 1. College of computer science and technology, Jiangsu Normal University, Xuzhou, Jiangsu 221116, China
    2. School of information engineering, Xuzhou Institute of Technology, Xuzhou, Jiangsu 221116, China
    3. Department of computer science and technology, Nanjing University, Nanjing 210023, China

面向CPS时空规则验证制导的安全强化学习

印婵, 祝义, 王金永, 陈小颖, 郝国生   

  1. 1. 江苏师范大学 计算机科学与技术学院, 江苏 徐州 221116
    2. 徐州工程学院 信息工程学院, 江苏 徐州 221018
    3. 南京大学 计算机科学与技术系, 南京 210023

Abstract: Deep reinforcement learning is currently a commonly used method in decision-making for Cyber Physical System (CPS). However, when facing an unknown environment and dealing with complex tasks, deep reinforcement learning based on black boxes cannot guarantee the security of the system and the interpretability of reward function settings. To address the above issues, a formalized spatio-temporal rule verification-guided safe reinforcement learning method is proposed. Firstly, the Combination-Space Rule Timed Communicating Sequential Process (CSR-TCSP) is proposed to model the system. Then it is validated by Failure Divergence Refinement (FDR) which is a model checker combined with the Spatio-Temporal Specification Language (STSL). Secondly, the structure of the reward state machine is formalized by abstracting the system environment model to propose the Spatio-Temporal Rule Reward Machine (STR-RM) which can guide the setting of reward functions in reinforcement learning. In addition, to monitor system operation and ensure the safety of output decisions, a monitor and a safe action decision-making algorithm are designed to obtain a more secure state-action strategy. Finally, the effectiveness of the proposed method is demonstrated through an example of obstacle avoidance and lane-changing overtaking in the autonomous driving system.

Key words: cyber physical system, formal method, process algebra, safe reinforcement learning, autonomous driving

摘要: 深度强化学习是目前信息物理融合系统(Cyber Physical System, CPS)决策中常用的一种方法。然而,当面对未知环境和复杂任务时,基于黑盒的深度强化学习方法在系统的安全性和奖励函数设置的可解释性方面存在不足。针对上述问题,提出了一种形式化时空规则验证制导的安全强化学习方法。首先,提出了时空规则通信顺序进程(Combination-Space Rule Timed Communicating Sequential Process, CSR-TCSP)对系统进行建模,并结合时空规约语言(Spatio-Temporal Specification Language, STSL)和模型检测工具FDR(Failure Divergence Refinement)对进程代数模型进行验证。其次,利用系统环境模型形式化奖励状态机的结构,提出了时空规则奖励状态机(Spatio-Temporal Rule Reward Machine, STR-RM)以指导强化学习中奖励函数的设置。此外,为了监测系统的运行并确保输出决策的安全性,设计了一个监控器及安全动作决策算法以获得更安全的状态行为策略。最后,通过一个自动驾驶系统中的避障与变道超车实例,证明本文所提方法的有效性。

关键词: 信息物理融合系统, 形式化方法, 进程代数, 安全强化学习, 自动驾驶