
计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (7): 1771-1788.DOI: 10.3778/j.issn.1673-9418.2406096
许光远,张亚强,史宏志
出版日期:2025-07-01
发布日期:2025-06-30
XU Guangyuan, ZHANG Yaqiang, SHI Hongzhi
Online:2025-07-01
Published:2025-06-30
摘要: 由大量异构资源构建的大规模算力集群是训练ChatGPT、Sora等大型深度神经网络(DNN)的必需品,然而训练过程中的故障率往往与训练规模正相关,集群中任一部件故障可能会导致训练任务中断,因此在大规模深度神经网络训练中实现高效容错能力变得尤为重要。对近年大规模神经网络训练的容错技术进行了深入探讨,重点讨论了不同层面上如何有效应对训练过程中出现的故障及相关技术可能存在的优劣势。说明了容错技术在大规模深度神经网络训练中所起的关键作用,阐述了近年来有关大规模深度神经网络训练的容错技术,根据作用主体从训练过程,以及训练体系架构和模块两个层面展开论述。训练过程中的容错设计主要包括检查点和非检查点容错技术。检查点容错技术旨在优化检查点的存储和传输,以减少故障发生时的数据丢失和训练恢复时间。非检查点技术则依赖于弹性训练、冗余计算及参数更新策略,提供更加灵活的故障恢复机制。在训练体系架构和模块的容错设计中,探讨了近年确保大规模集群稳定性的故障检测和管理等容错技术。讨论了容器、数据预处理和矩阵乘法等模块的容错措施,以提升训练任务的效率与容错能力。通过归纳和分析近年提出的容错技术,总结了面向大规模深度学习训练场景的现有挑战,并展望了未来的发展方向,提出了相关优化方向,以适应更大规模深度学习训练场景的容错需求。
许光远, 张亚强, 史宏志. 面向大规模DNN训练场景的容错技术综述[J]. 计算机科学与探索, 2025, 19(7): 1771-1788.
XU Guangyuan, ZHANG Yaqiang, SHI Hongzhi. Review of Fault-Tolerant Technologies for Large-Scale DNN Training Scenarios[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1771-1788.
| [1] OpenAI. ChatGPT[EB/OL]. [2024-05-19]. https://chat.openai. com/. [2] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. [2024-05-19]. https://arxiv.org/abs/2303.08774. [3] OpenAI. Sora: creating video from text[EB/OL]. [2024-05-19]. https://openai.com/index/sora/. [4] CHOWDHERY A, NARANG S R, DEVLIN J, et al. PaLM: scaling language modeling with pathways[J]. The Journal of Machine Learning Research, 2023, 24(1): 11324-11436. [5] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9. [6] JIA X Y, JIANG L, WANG A, et al. Whale: efficient giant model training over heterogeneous GPUs[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2011.09208. [7] LIAN X R, YUAN B H, ZHU X F, et al. Persia: an open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2022: 3288-3298. [8] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-16. [9] RAJBHANDARI S, RUWASE O, RASLEY J, et al. ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning[C]//Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021: 1-14. [10] REN J, RAJBHANDARI S, AMINABADI R Y, et al. ZeRO-Offload: democratizing billion-scale model training[C]//Proceedings of the 2021 USENIX Annual Technical Conference, 2021: 551-564. [11] BEKMAN S. The technology behind bloom training[EB/OL]. [2024-05-29]. https://huggingface.co/blog/bloom-megatron-deepspeed. [12] FAN S Q, RONG Y, MENG C, et al. DAPPLE: a pipelined data parallel approach for training large models[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2021: 431-445. [13] NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles. New York: ACM, 2019: 1-15. [14] SMITH S, PATWARY M, NORICK B, et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2201.11990. [15] ZHANG S S, ROLLER S, GOYAL N, et al. OPT: open pre-trained transformer language models[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2205.01068. [16] ZHENG L, LI Z, ZHANG H, et al. Alpa: automating inter-and intra-operator parallelism for distributed deep learning[C]//Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022: 559-578. [17] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM[C]//Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2021: 1-14. [18] GUPTA S, PATEL T, ENGELMANN C, et al. Failures in large scale systems: long-term measurement, analysis, and implications[C]//Proceedings of the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2017: 1-12. [19] JEON M, VENKATARAMAN S, PHANISHAYEE A, et al. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads[C]//Proceedings of the 2019 USENIX Annual Technical Conference, 2019: 947-960. [20] WENG Q, XIAO W, YU Y, et al. MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters[C]//Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022: 945-960. [21] DUBEY A, JAUHRI A, PANDEY A, et al. The Llama 3 herd of models[EB/OL]. [2024-05-31]. https://arxiv.org/abs/2407.21783. [22] NVIDIA. Introducing NVIDIA Merlin HugeCTR: a training framework dedicated to recommender systems[EB/OL]. [2024-05-31]. https://developer.nvidia.com/blog/introducing-merlin-hugectr-training-framework-dedicated-to-recommender-systems/. [23] NAUMOV M, KIM J, MUDIGERE D, et al. Deep learning training in facebook data centers: design of scale-up and scale-out systems[EB/OL]. [2024-05-31]. https://arxiv.org/abs/2003.09518. [24] MAENG K, BHARUKA S, GAO I, et al. CPR: understanding and improving failure tolerant training for deep learning recommendation with partial recovery[J]. Proceedings of Machine Learning and Systems, 2021, 3: 637-651. [25] IBRAHIM Y, WANG H B, LIU J Y, et al. Soft errors in DNN accelerators: a comprehensive review[J]. Microelectronics Reliability, 2020, 115: 113969. [26] MITTAL S. A survey on modeling and improving reliability of DNN algorithms and accelerators[J]. Journal of Systems Architecture, 2020, 104: 101689. [27] OUYANG S, DONG D Z, XU Y M, et al. Communication optimization strategies for distributed deep neural network training: a survey[J]. Journal of Parallel and Distributed Computing, 2021, 149: 52-65. [28] LIANG P, TANG Y, ZHANG X D, et al. A survey on auto-parallelism of large-scale deep learning training[J]. IEEE Transactions on Parallel and Distributed Systems, 2023, 34(8): 2377-2390. [29] 冯杨洋, 汪庆, 谢旻晖, 等. 从BERT到ChatGPT: 大模型训练中的存储系统挑战与技术发展[J]. 计算机研究与发展, 2024, 61(4): 809-823. FENG Y Y, WANG Q, XIE M H, et al. From BERT to ChatGPT: challenges and technical development of storage systems for large model training[J]. Journal of Computer Research and Development, 2024, 61(4): 809-823. [30] MOHAN J, PHANISHAYEE A, CHIDAMBARAM V. CheckFreq: frequent, fine-grained DNN checkpointing[C]//Proceedings of the 19th USENIX Conference on File and Storage Technologies, 2021: 203-216. [31] EISENMAN A, MATAM K K, INGRAM S, et al. Check-N-Run: a checkpointing system for training deep learning recommendation models[C]//Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022: 929-943. [32] CHEN M L, HUA Y, BAI R, et al. A cost-efficient failure-tolerant scheme for distributed DNN training[C]//Proceedings of the 2023 IEEE 41st International Conference on Computer Design. Piscataway: IEEE, 2023: 150-157. [33] WANG Z, JIA Z, ZHENG S, et al. GEMINI: fast failure recovery in distributed training with in-memory checkpoints[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 364-381. [34] WANG Y X, SHI S, HE X, et al. Reliable and efficient in-memory fault tolerance of large language model pretraining[EB/OL]. [2024-05-28]. https://arxiv.org/abs/2310.12670. [35] LF AI & Data Foundation. Elastic Horovod[EB/OL]. [2024-05-31]. https://horovod.readthedocs.io/en/stable/elastic_include.html. [36] LI J L, BOSILCA G, BOUTEILLER A, et al. Elastic deep learning through resilient collective operations[C]//Proceedings of the 2023 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. New York: ACM, 2023: 44-50. [37] HSIEH T T, LEE C R. Voda: a GPU scheduling platform for elastic deep learning in Kubernetes clusters[C]//Proceedings of the 2023 IEEE International Conference on Cloud Engineering. Piscataway: IEEE, 2023: 131-140. [38] ZHOU J, ZHANG K, ZHU F, et al. ElasticDL: a Kubernetes- native deep learning framework with fault-tolerance and elastic scheduling[C]//Proceedings of the 16th ACM International Conference on Web Search and Data Mining. New York: ACM, 2023: 1148-1151. [39] WANG Q L, SANG B, ZHANG H T, et al. DLRover: an elastic deep training extension with auto job resource recommendation[EB/OL]. [2024-06-02]. https://arxiv.org/abs/2304.01468. [40] XIE L, ZHAI J D, WU B D, et al. Elan: towards generic and efficient elastic training for deep learning[C]//Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems. Piscataway: IEEE, 2020: 78-88. [41] WAGENL?NDER M, LI G, ZHAO B, et al. Tenplex: dynamic parallelism for deep learning using parallelizable tensor collections[EB/OL]. [2024-06-05]. https://arxiv.org/abs/2312.05181. [42] ATHLUR S, SARAN N, SIVATHANU M, et al. Varuna: scalable, low-cost training of massive deep learning models[C]//Proceedings of the 17th European Conference on Computer Systems. New York: ACM, 2022: 472-487. [43] THORPE J, ZHAO P Z, EYOLFSON J, et al. Bamboo: making preemptible instances resilient for affordable training of large DNNs[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 497-513. [44] JANG I, YANG Z N, ZHANG Z, et al. Oobleck: resilient distributed training of large models using pipeline templates[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 382-395. [45] ZHONG Y C, SHENG G M, LIU J C, et al. Swift: expedited failure recovery for large-scale DNN training[J]. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(9): 1644-1656. [46] AGARWAL S, YAN C P, ZHANG Z Y, et al. Bagpipe: accelerating deep recommendation model training[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 348-363. [47] ZHANG T Y, LIU K G, KOSAIAN J, et al. Efficient fault tolerance for recommendation model training via erasure coding[J]. Proceedings of the VLDB Endowment, 2023, 16(11): 3137-3150. [48] JIANG Z H, LIN H B, ZHONG Y M, et al. MegaScale: scaling large language model training to more than 10,000 GPUs [EB/OL]. [2024-06-05]. https://arxiv.org/abs/2402.15627. [49] WU B D, XIA L, LI Q P, et al. TRANSOM: an efficient fault-tolerant system for training LLMs[EB/OL]. [2024-06-05]. https://arxiv.org/abs/2310.10046. [50] TANG Z H, WANG Y X, HE X, et al. FusionAI: decentralized training and deploying LLMs with massive consumer-level GPUs[EB/OL]. [2024-05-19]. https://arxiv.org/abs/2309. 01172. [51] WU B, ZHANG Z, BAI Z, et al. Transparent GPU sharing in container clouds for deep learning workloads[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 69-85. [52] ZHAO H Y, YANG Z, CHENG Y, et al. GoldMiner: elastic scaling of training data pre-processing pipelines for deep learning[J]. Proceedings of the ACM on Management of Data, 2023, 1(2): 1-25. [53] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, 2020: 1877-1901. [54] 田海东, 张明政, 常锐, 等. 大模型训练技术综述[J]. 中兴通讯技术, 2024, 30(2): 21-28. TIAN H D, ZHANG M Z, CHANG R, et al. A survey on large model training technologies[J]. ZTE Technology Journal, 2024, 30(2): 21-28. [55] OSTROUCHOV G, MAXWELL D, ASHRAF R A, et al. GPU lifetimes on Titan supercomputer: survival analysis and reliability[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-14. [56] TAHERIN A, PATEL T, GEORGAKOUDIS G, et al. Examining failures and repairs on supercomputers with multi-GPU compute nodes[C]//Proceedings of the 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2021: 305-313. [57] TIWARI D, GUPTA S, ROGERS J, et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation[C]//Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture. Piscataway: IEEE, 2015: 331-342. [58] GILL P, JAIN N, NAGAPPAN N. Understanding network failures in data centers: measurement, analysis, and implications[C]//Proceedings of the ACM SIGCOMM 2011 Conference. New York: ACM, 2011: 350-361. [59] TAN C, JIN Z, GUO C, et al. NetBouncer: active device and link failure localization in data center networks[C]//Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, 2019: 599-614. [60] ABADI M, BARHAM P, CHEN J M, et al. TensorFlow: a system for large-scale machine learning[C]//Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016: 265-283. [61] PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[C]//Advances in Neural Information Processing Systems 32, 2019. [62] CHEN T Q, LI M, LI Y T, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems[EB/OL]. [2024-06-05]. https://arxiv.org/abs/1512.01274. [63] WU S H, ZHAO X D, WANG S L, et al. YUAN 2.0: a large language model with localized filtering-based attention[EB/OL]. [2024-06-08]. https://arxiv.org/abs/2311.15786. [64] 郑纬民. 构建支持大模型训练的计算机系统需要考虑的4个问题[J]. 大数据, 2024, 10(1): 1-8. ZHENG W M. Four issues to consider in building a computer system supporting large model training[J]. Big Data Research, 2024, 10(1): 1-8. [65] NICOLAE B, LI J L, WOZNIAK J M, et al. DeepFreeze: towards scalable asynchronous checkpointing of deep learning models[C]//Proceedings of the 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing. Piscataway: IEEE, 2020: 172-181. [66] GUAN H, MALEVICH A, YANG J Y, et al. Post-training 4-bit quantization on embedding tables[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1911.02079. [67] YANG J, KIM J, HOSEINZADEH M, et al. An empirical guide to the behavior and use of scalable persistent memory[C]//Proceedings of the 18th USENIX Conference on File and Storage Technologies, 2020: 169-182. [68] Meta. TorchSnapshot: a performant, memory-efficient checkpointing library for PyTorch applications[CP/OL]. [2024-06-08]. https://github.com/pytorch/torchsnapshot. [69] NICOLAE B, MOODY A, GONSIOROWSKI E, et al. VeloC: towards high performance adaptive asynchronous checkpointing at large scale[C]//Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium. Piscataway: IEEE, 2019: 911-920. [70] HUANG Y, CHENG Y, BAPNA A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism[C]//Advances in Neural Information Processing Systems 32, 2019. [71] NICOLAE B, HOBSON T, YILDIZ O, et al. Towards low-overhead resilience for data parallel deep learning[C]//Proceedings of the 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing. Piscataway: IEEE, 2022: 336-345. [72] YOU Y, LI J, REDDI S, et al. Large batch optimization for deep learning: training BERT in 76 minutes[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1904.00962. [73] REDDI S J, KALE S, KUMAR S, et al. On the convergence of Adam and beyond[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1904.09237. [74] KIM K H, JEONG C S. Optimizing single DGX-A100 system: overcoming GPU limitations via efficient parallelism and scheduling for large language models[J]. Applied Sciences, 2023, 13(16): 9306. [75] WANG W, GHOBADI M, SHAKERI K, et al. Optimized network architectures for large language model training with billions of parameters[EB/OL]. [2024-06-08]. https://arxiv.org/abs/2307.12169, 2023. [76] NVIDIA. NVIDIA DGX SuperPOD: scalable infrastructure for AI leadership[EB/OL]. [2024-06-08]. https://docscontent.nvidia.com/67/d5/e40a7c5c45368111458d846abfe3/tme123-ra09950001-dspa100-refarch.pdf. [77] BAI W, ABDEEN S S, AGRAWAL A, et al. Empowering azure storage with RDMA[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 49-67. [78] WANG W, KHAZRAEE M, ZHONG Z, et al. TopoOpt: co-optimizing network topology and parallelization strategy for distributed training jobs[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 739-767. [79] XIAO W, REN S, LI Y, et al. AntMan: dynamic scaling on GPU clusters for deep learning[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020: 533-548. [80] NVIDIA. CUDA multi-process service[EB/OL]. [2024-06-10]. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. [81] NVIDIA. NVIDIA multi-instance GPU (MIG)[EB/OL]. [2024-06-10]. https://www.nvidia.com/en-us/technologies/multi-instance-gpu/. [82] WU S X, ZHAI Y J, LIU J Y, et al. Anatomy of high-performance GEMM with online fault tolerance on GPUs[C]//Proceedings of the 37th International Conference on Supercomputing. New York: ACM, 2023: 360-372. [83] XUE X H, LIU C, MIN F, et al. ApproxABFT: approximate algorithm-based fault tolerance for neural network processing[EB/OL]. [2024-06-10]. https://arxiv.org/abs/2302.10469. |
| [1] | 张亦菲, 李艳玲, 葛凤培. 基于图深度学习的司法判决预测综述[J]. 计算机科学与探索, 2025, 19(8): 2024-2042. |
| [2] | 洪维, 耿沛霖, 王弘宇, 张雪芹, 顾春华. 结合图像显著性区域的局部动态干净标签后门攻击[J]. 计算机科学与探索, 2025, 19(8): 2229-2240. |
| [3] | 闵锋, 刘宇卓, 刘煜晖, 刘彪. 基于时域频域混合特征的多变量时序预测模型[J]. 计算机科学与探索, 2025, 19(8): 2099-2109. |
| [4] | 董甲东, 桑飞虎, 郭庆虎, 陈琳, 郑春香. 基于深度学习的目标检测算法轻量化研究综述[J]. 计算机科学与探索, 2025, 19(8): 2057-2084. |
| [5] | 周开军, 廖婷, 谭平, 史长发. 图像压缩技术研究综述[J]. 计算机科学与探索, 2025, 19(7): 1699-1728. |
| [6] | 陈旭, 张其, 王叔洋, 景永俊. 自适应积空间离散动态图链接预测模型[J]. 计算机科学与探索, 2025, 19(7): 1820-1831. |
| [7] | 许德龙, 林民, 王玉荣, 张树钧. 基于大语言模型的NLP数据增强方法综述[J]. 计算机科学与探索, 2025, 19(6): 1395-1413. |
| [8] | 李云飞, 魏霞, 蔡鑫, 吕明昱, 罗相涵. TCTP-YOLO:盲人出行的典型障碍物及交通标志检测方法[J]. 计算机科学与探索, 2025, 19(6): 1540-1552. |
| [9] | 周楠, 董永权, 闫林克, 金家永, 贺步贵. 融合学生知识状态与混沌萤火虫算法的习题推荐研究[J]. 计算机科学与探索, 2025, 19(6): 1620-1631. |
| [10] | 朱佳音, 李杨, 李明, 马金刚. 深度学习在宫颈细胞分割中的应用综述[J]. 计算机科学与探索, 2025, 19(6): 1476-1493. |
| [11] | 梁洁欣, 冯跃, 李健忠, 陈涛, 林卓胜, 何盈, 王松柏. 中医体质智能辨识方法的研究综述[J]. 计算机科学与探索, 2025, 19(6): 1455-1475. |
| [12] | 吕伏, 郑禹, 齐光尧, 李浩然. 极坐标编解码的轻量化SAR图像舰船斜框检测算法[J]. 计算机科学与探索, 2025, 19(6): 1564-1579. |
| [13] | 李少波, 王晓强, 郭利彪, 红英, 王志国. 草类植物无人机遥感图像中深度学习应用综述[J]. 计算机科学与探索, 2025, 19(5): 1157-1176. |
| [14] | 王宁, 智敏. 深度学习下的单阶段通用目标检测算法研究综述[J]. 计算机科学与探索, 2025, 19(5): 1115-1140. |
| [15] | 杨智勇, 郭洁铷, 郭子杭, 张瑞祥, 周瑜. 道路行人行为轨迹预测研究综述[J]. 计算机科学与探索, 2025, 19(5): 1177-1197. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||