Formal Verification of Spatio-Temporal Rules Guided Safe Reinforcement Learning for CPS

doi:10.3778/j.issn.1673-9418.2312010

Abstract

Abstract: Deep reinforcement learning is currently a commonly used method in decision-making for cyber physical system (CPS). However, when facing an unknown environment and dealing with complex tasks, deep reinforcement learning based on black boxes cannot guarantee the security of the system and the interpretability of reward function settings. To address the above issues, a formalized spatio-temporal rule verification-guided safe reinforcement learning method is proposed. Firstly, the combination-space rule timed communicating sequential process (CSR-TCSP) is proposed to model the system. Then it is validated by failure divergence refinement (FDR) which is a model checker combined with the spatio-temporal specification language (STSL). Secondly, the structure of the reward state machine is formalized by abstracting the system environment model to propose the spatio-temporal rule reward machine (STR-RM) which can guide the setting of reward functions in reinforcement learning. In addition, to monitor system operation and ensure the safety of output decisions, a monitor and a safe action decision-making algorithm are designed to obtain a more secure state-action strategy. Finally, the effectiveness of the proposed method is demonstrated through an example of obstacle avoidance and lane-changing overtaking in the autonomous driving system.

Key words: cyber physical system, formal method, process algebra, safe reinforcement learning, autonomous driving

摘要： 深度强化学习是目前信息物理融合系统（CPS）决策中常用的一种方法。然而，当面对未知环境和复杂任务时，基于黑盒的深度强化学习方法在系统的安全性和奖励函数设置的可解释性方面存在不足。针对上述问题，提出了一种形式化时空规则验证制导的安全强化学习方法。提出了时空规则通信顺序进程（CSR-TCSP）对系统进行建模，并结合时空规约语言（STSL）和模型检测工具FDR对进程代数模型进行验证。利用系统环境模型形式化奖励状态机的结构，提出了时空规则奖励状态机（STR-RM）以指导强化学习中奖励函数的设置。此外，为了监测系统的运行并确保输出决策的安全性，设计了一个监控器及安全动作决策算法以获得更安全的状态行为策略。通过一个自动驾驶系统中的避障与变道超车实例，证明所提方法的有效性。

关键词: 信息物理融合系统, 形式化方法, 进程代数, 安全强化学习, 自动驾驶

YIN Chan, ZHU Yi, WANG Jinyong, CHEN Xiaoying, HAO Guosheng. Formal Verification of Spatio-Temporal Rules Guided Safe Reinforcement Learning for CPS[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(2): 513-527.

印婵, 祝义, 王金永, 陈小颖, 郝国生. 面向CPS时空规则验证制导的安全强化学习[J]. 计算机科学与探索, 2025, 19(2): 513-527.

References

[1] LIN J, YU W, ZHANG N, et al. A survey on Internet of things: architecture, enabling technologies, security and privacy, and applications[J]. IEEE Internet of Things Journal, 2017, 4(5): 1125-1142.
[2] 王金永, 黄志球, 杨德艳, 等. 面向无人驾驶时空同步约束制导的安全强化学习[J]. 计算机研究与发展, 2021, 58(12): 2585-2603.
WANG J Y, HUANG Z Q, YANG D Y, et al. Spatio-clock synchronous constraint guided safe reinforcement learning for autonomous driving[J]. Journal of Computer Research and Development, 2021, 58(12): 2585-2603.
[3] GAON M, BRAFMAN R. Reinforcement learning with non-Markovian rewards[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 3980-3987.
[4] RONG J, LUAN N. Safe reinforcement learning with policy-guided planning for autonomous driving[C]//Proceedings of the 2020 IEEE International Conference on Mechatronics and Automation. Piscataway: IEEE, 2020: 320-326.
[5] TAN Y, VURAN M C, GODDARD S. Spatio-temporal event model for cyber-physical systems[C]//Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems Workshops. Piscataway: IEEE, 2009: 44-50.
[6] CHOUHAN A P, BANDA G. Formal verification of heuristic autonomous intersection management using statistical model checking[J]. Sensors, 2020, 20(16): 4506.
[7] 祝义, 黄志球, 张广泉, 等. 一种支持实时软件时间建模的形式化方法[J]. 解放军理工大学学报(自然科学版), 2010(3): 274-278.
ZHU Y, HUANG Z Q, ZHANG G Q, et al. Formal method supporting for time modeling of real-time softwares[J]. Journal of PLA University of Science and Technology (Natural Science Edition), 2010(3): 274-278.
[8] 陈小颖, 祝义, 赵宇, 等. 面向CPS时空约束的资源建模及其安全性验证方法[J]. 软件学报, 2022, 33(8): 2815-2838.
CHEN X Y, ZHU Y, ZHAO Y, et al. Modeling and safety verification method for CPS time and topology constrained resources[J]. Journal of Software, 2022, 33(8): 2815-2838.
[9] GARCIA J, FERNÁNDEZ F. A comprehensive survey on safe reinforcement learning[J]. Journal of Machine Learning Research, 2015, 16: 1437-1480.
[10] KADOTA Y, KURANO M, YASUDA M. Discounted Markov decision processes with utility constraints[J]. Computers & Mathematics with Applications, 2006, 51(2): 279-284.
[11] GEIBEL P. Reinforcement learning for MDPs with constraints[C]//Proceedings of the 9th European Conference on Machine Learning. Berlin, Heidelberg: Springer, 2006: 646-653.
[12] MOLDOVAN T M, ABBEEL P. Safe exploration in Markov decision processes[C]//Proceedings of the 29th International Conference on Machine Learning. Madison: Omnipress, 2012: 1451-1458.
[13] HEGER M. Consideration of risk in reinforcement learning[C]//Proceedings of the 11th International Conference on Machine Learning. San Mateo: Morgan Kaufmann, 1994: 105-111.
[14] TAMAR A, MANNOR S, XU H. Scaling up robust mdps using function approximation[C]//Proceedings of the 31st International Conference on Machine Learning. Cambridge: MIT Press, 2014: 181-189.
[15] NILIM A, EL GHAOUI L. Robust control of Markov decision processes with uncertain transition matrices[J]. Operations Research, 2005, 53(5): 780-798.
[16] GAO Q, HAJINEZHAD D, ZHANG Y, et al. Reduced variance deep reinforcement learning with temporal logic specifications[C]//Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems. New York: ACM, 2019: 237-248.
[17] WANG C, LI Y, SMITH S L, et al. Continuous motion planning with temporal logic specifications using deep neural networks[EB/OL]. [2023-11-20]. https://arxiv.org/abs/2004.02610.
[18] WANG J, ZHANG Q, ZHAO D, et al. Lane change decision-making through deep reinforcement learning with rule-based constraints[C]//Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway: IEEE, 2019: 1-6.
[19] VAN HASSELT H, GUEZ A, SILVER D, et al. Deep reinforcement learning with double Q-learning[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2016: 2094-2100.
[20] KRASOWSKI H, WANG X, ALTHOFF M. Safe reinforcement learning for autonomous lane changing using set-based prediction[C]//Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems. Piscataway: IEEE, 2020: 1-7.
[21] WACHI A, SUI Y, WACHI A, et al. Safe reinforcement learning in constrained Markov decision processes[C]//Proceedings of the 37th International Conference on Machine Learning. New York: ACM, 2020: 9797-9806.
[22] LI T, LIU J, KANG J, et al. STSL: a novel spatio-temporal specification language for cyber-physical systems[C]//Proceedings of the 2020 IEEE 20th International Conference on Software Quality, Reliability and Security. Piscataway: IEEE, 2020: 309-319.
[23] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[J]. IEEE Transactions on Neural Networks, 1998, 9(5): 1054.
[24] HOARE C A R. Communicating sequential process[J]. Communications of the ACM, 1978, 21(8): 666-677.
[25] REED G M, ROSCOE A W. A timed model for communicating sequential processes[J]. Theoretical Computer Science, 1988, 58(1/2/3): 249-261.
[26] DAVIES J, SCHNEIDER S. A brief history of timed CSP[J]. Theoretical Computer Science, 1995, 138(2): 243-271.
[27] RANDELL D A, CUI Z, COHN A G, et al. A spatial logic based on regions and connection[C]//Proceedings of the 3rd International Conference on Principles of Knowledge Representation and Reasoning. New York: ACM, 1992: 165-176.
[28] 陈小颖, 祝义, 赵宇, 等. 面向CPS时空性质验证的混成AADL建模与模型转换方法[J]. 软件学报, 2021, 32(6): 1779-1798.
CHEN X Y, ZHU Y, ZHAO Y, et al. Hybrid AADL modeling and model transformation for CPS time and space properties verification[J]. Journal of Software, 2021, 32(6): 1779-1798.
[29] SCHNEIDER S. An operational semantics for timed CSP[J]. Information and Computation, 1995, 116(2): 193-213.
[30] 祝义, 黄志球, 曹子宁, 等. 一种基于形式化规约生成软件体系结构模型的方法[J]. 软件学报, 2010, 21(11): 2738-2751.
ZHU Y, HUANG Z Q, CAO Z N, et al. Method for generating software architecture models from formal specifications[J]. Journal of Software, 2010, 21(11): 2738-2751.