面向大规模DNN训练场景的容错技术综述
许光远, 张亚强, 史宏志
Review of Fault-Tolerant Technologies for Large-Scale DNN Training Scenarios
XU Guangyuan, ZHANG Yaqiang, SHI Hongzhi
计算机科学与探索 . 2025, (7): 1771 -1788 .  DOI: 10.3778/j.issn.1673-9418.2406096