计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (2): 378-386.DOI: 10.3778/j.issn.1673-9418.2210090

• 理论·算法 • 上一篇    下一篇

融合强化学习的三支治略选择及其有效性分析

刘晓雪,姜春茂   

  1. 1. 哈尔滨师范大学 计算机科学与信息工程学院,哈尔滨 150025
    2. 福建理工大学 计算机科学与数学学院,福州 350118
  • 出版日期:2024-02-01 发布日期:2024-02-01

Strategy Selection and Outcome Evaluation of Three-Way Decisions Based on Reinforcement Learning

LIU Xiaoxue, JIANG Chunmao   

  1. 1. School of Computer Science and Information Engineering, Harbin Normal University, Harbin 150025, China
    2. School of Computer Science and Mathematics, Fujian University of Technology, Fuzhou 350118, China
  • Online:2024-02-01 Published:2024-02-01

摘要: 三支决策的“分、治、效”(TAO)模型包括构建三分、施加策略、结果评估三个部分。目前,关于结果评估的研究旨在衡量策略施加后结果的前后变化,还无法预测施加哪个策略能达到最大效果。为了解决这一问题,对TAO模型的“治”和“效”进行了研究,提出一种基于强化学习的三支改变模型的策略选择与有效性预测的方法。首先将改变三支决策TAO模型中的改变三分状态和策略分别作为强化学习中的状态和动作,并将每次施加策略得到新的改变三分状态的过程看作一个周期,利用累积前景理论计算每个周期产生的奖励,将智能体与环境的交互过程用马尔可夫决策过程来表示;其次设置一个目标奖励,将各个周期的累计奖励达到目标奖励时的状态作为马尔可夫决策过程的终止状态;然后用Q-learning算法迭代出一个最短周期内达到目标奖励的策略序列,同时利用该策略序列预测当前改变三分状态的未来效用。最后使用一个实例体现出该方法实用性和有效性。

关键词: 三支决策, 改变三支决策, 强化学习, 策略选择, 效用度量

Abstract: The trisecting-acting-outcome (TAO) model of three-way decision (3WD) consists of three steps: trisect a whole, design action strategies, and outcome analysis and measurement. Currently, research on outcome evaluation aims to measure the pre- and post-change in outcomes following the implementation of strategies, and it is still unable to predict which strategy will achieve the maximum effect. To narrow down this gap, this paper focuses on the “acting” and “outcome” of the TAO model and introduces a method for strategy selection and outcome prediction for the change-based three-way decision based on Q-learning in reinforcement learning. Firstly, the approach is to treat the altered tri-partition and the acting in the change-based three-way decision TAO model as states and actions in reinforcement learning, respectively, and to consider the process of obtaining a newly altered tri-partition each time under the acting of action or strategy as a cycle. The reward generated by each cycle is calculated using cumulative prospect theory, and the interaction process between the agent and the environment is represented by a Markov decision process. Secondly, a target reward is set, and the state when the cumulative reward of each cycle reaches the target reward is taken as the termination state of the Markov decision process. Then a Q-learning algorithm is used to iterate a set of actions that achieve the target reward in the shortest cycle and then the action set is used to predict the future utility of the change-based three-way decision. Finally, an example is employed to illustrate the applicability and effectiveness of the method.

Key words: three-way decision, change-based three-way decision, reinforcement learning, strategy selection, outcome evaluation