计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (6): 1476-1490.DOI: 10.3778/j.issn.1673-9418.2310026

• 理论·算法 • 上一篇    下一篇

基于规则集成的可解释机器学习算法及应用

闵继源,鲁统宇,任婷婷,陈汝昊   

  1. 1. 中国计量大学 经济与管理学院,杭州 310018
    2. 东南大学 网络空间安全学院,南京 211189
  • 出版日期:2024-06-01 发布日期:2024-05-31

Interpretable Machine Learning Algorithm Based on Rules Ensemble and Its Application

MIN Jiyuan, LU Tongyu, REN Tingting, CHEN Ruhao   

  1. 1. College of Economics and Management, China Jiliang University, Hangzhou 310018, China
    2. School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
  • Online:2024-06-01 Published:2024-05-31

摘要: 机器学习算法因其良好的预测性能已经取得了巨大的成功,但在对模型可解释性有着较高需求的领域,其适用性受到了限制。针对机器学习算法缺乏可解释性的缺点,基于规则集成思想提出一种新的可解释机器学习算法,称之为集成树惩罚逻辑规则回归,它能以较小的结构复杂度实现与集成树算法相当的预测性能,并且保留了逻辑回归的解释效果。首先,从随机森林、XGBoost等集成树中提取分枝,并将其转换为逻辑规则。其次,对规则集进行剪枝和去重处理,以得到精简的规则集。最后,将规则作为变量融入逻辑回归中,并以Lasso算法进行复杂度控制。以企业风险预警作为实例,与多种机器学习算法进行实验对比,结果表明此算法不仅能很好地继承集成树的违约判别能力,在各个分类指标上均超越了大多数机器学习算法,而且可以通过规则给出企业风险指标的阈值,便于企业进行风险管理。进一步地,根据此算法制作企业信用评分,验证了它的广泛适用性,得到的评分符合客观规律且具有区分度,然后通过三个公开数据集验证了模型预测性能的稳健性。

关键词: 可解释机器学习, 规则学习, 非线性回归, 集成树, 风险预警

Abstract: Machine learning algorithms have achieved great success due to their excellent predictive performance, but their applicability is limited in areas where there is a high demand for model interpretability. Aiming at the weakness of lacking interpretability of machine learning algorithms, a new interpretable machine learning algorithm called ensemble trees penalized logistic rule regression is proposed based on the idea of rules ensemble, which can achieve comparable predictive performance to the ensemble trees algorithm with less structural complexity and retains the interpretive effect of logistic regression. Firstly, it extracts branches from ensemble trees such as random forest and XGBoost, and converts them into logic rules. Then, the rule set is pruned and deduplicated to obtain a streamlined rule set. Finally, the rules are incorporated into logistic regression as variables and complexity control is performed with Lasso algorithm. Taking the enterprise risk warning as an example, it is compared with multiple machine learning algorithms. The results show that this algorithm not only inherits the default discrimination ability of the ensemble trees well and exceeds most of the machine learning algorithms in various classification indices, but also can give the thresholds of the enterprise risk indices through the rules, which is convenient for enterprises to carry out risk management. Further, the enterprise credit score is produced according to this algorithm, which verifies its wide applicability. The obtained score conforms to the objective law and is discriminative, and the robustness of the model’s prediction performance is verified by three public datasets.

Key words: interpretable machine learning, rule learning, nonlinear regression, ensemble trees, risk early warning