Review of Fault-Tolerant Technologies for Large-Scale DNN Training Scenarios

doi:10.3778/j.issn.1673-9418.2406096

Abstract

Abstract: Large-scale computing clusters composed of heterogeneous resources are essential for training large deep neural networks (DNN) such as ChatGPT and Sora. However, the failure rate during the training process tends to be positively correlated with the training scale. A failure in any component of the cluster may lead to an interruption of the training task, making it crucial to implement efficient fault-tolerant mechanisms in large-scale deep neural network training. This paper delves into fault-tolerant technologies developed in recent years for large-scale neural network training, with a focus on how to effectively address training failures at different levels, as well as the potential advantages and limitations of these technologies. Firstly, the paper explains the critical role of fault-tolerant technologies in large-scale deep neural network training, and then discusses recent advancements in fault-tolerant technologies for large-scale deep neural network training. These are categorized into two levels based on their focus: in the training process and in the system architecture and modules. In terms of fault-tolerant design for the training process, the paper covers both checkpoint-based and non-checkpoint-based fault-tolerant technologies. Checkpoint-based fault-tolerant technologies aim to optimize the storage and transmission of checkpoints, minimizing data loss and recovery time in the event of failure. Non-checkpoint-based technologies rely on elastic training, redundant computing, and parameter update strategies, providing more flexible failure recovery mechanisms. In terms of fault-tolerant design for the system architecture and modules, the paper explores recent fault detection and management technologies that ensure the stability of large-scale clusters. Additionally, the paper discusses fault-tolerant measures for modules such as containers, data preprocessing, and matrix multiplication, which improve the efficiency and fault-tolerant capabilities of training tasks. Finally, the paper summarizes and analyzes fault-tolerant technologies proposed in recent years, outlines the current challenges in large-scale deep learning training, and offers insights into future directions for optimization to meet the fault-tolerant demands of larger-scale deep learning training scenarios.

Key words: fault-tolerance, deep learning, model training, ChatGPT

摘要： 由大量异构资源构建的大规模算力集群是训练ChatGPT、Sora等大型深度神经网络（DNN）的必需品，然而训练过程中的故障率往往与训练规模正相关，集群中任一部件故障可能会导致训练任务中断，因此在大规模深度神经网络训练中实现高效容错能力变得尤为重要。对近年大规模神经网络训练的容错技术进行了深入探讨，重点讨论了不同层面上如何有效应对训练过程中出现的故障及相关技术可能存在的优劣势。说明了容错技术在大规模深度神经网络训练中所起的关键作用，阐述了近年来有关大规模深度神经网络训练的容错技术，根据作用主体从训练过程，以及训练体系架构和模块两个层面展开论述。训练过程中的容错设计主要包括检查点和非检查点容错技术。检查点容错技术旨在优化检查点的存储和传输，以减少故障发生时的数据丢失和训练恢复时间。非检查点技术则依赖于弹性训练、冗余计算及参数更新策略，提供更加灵活的故障恢复机制。在训练体系架构和模块的容错设计中，探讨了近年确保大规模集群稳定性的故障检测和管理等容错技术。讨论了容器、数据预处理和矩阵乘法等模块的容错措施，以提升训练任务的效率与容错能力。通过归纳和分析近年提出的容错技术，总结了面向大规模深度学习训练场景的现有挑战，并展望了未来的发展方向，提出了相关优化方向，以适应更大规模深度学习训练场景的容错需求。

关键词: 容错, 深度学习, 模型训练, ChatGPT

XU Guangyuan, ZHANG Yaqiang, SHI Hongzhi. Review of Fault-Tolerant Technologies for Large-Scale DNN Training Scenarios[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1771-1788.

许光远, 张亚强, 史宏志. 面向大规模DNN训练场景的容错技术综述[J]. 计算机科学与探索, 2025, 19(7): 1771-1788.

References

[1] OpenAI. ChatGPT[EB/OL]. [2024-05-19]. https://chat.openai. com/.
[2] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. [2024-05-19]. https://arxiv.org/abs/2303.08774.
[3] OpenAI. Sora: creating video from text[EB/OL]. [2024-05-19]. https://openai.com/index/sora/.
[4] CHOWDHERY A, NARANG S R, DEVLIN J, et al. PaLM: scaling language modeling with pathways[J]. The Journal of Machine Learning Research, 2023, 24(1): 11324-11436.
[5] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[6] JIA X Y, JIANG L, WANG A, et al. Whale: efficient giant model training over heterogeneous GPUs[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2011.09208.
[7] LIAN X R, YUAN B H, ZHU X F, et al. Persia: an open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2022: 3288-3298.
[8] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-16.
[9] RAJBHANDARI S, RUWASE O, RASLEY J, et al. ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning[C]//Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021: 1-14.
[10] REN J, RAJBHANDARI S, AMINABADI R Y, et al. ZeRO-Offload: democratizing billion-scale model training[C]//Proceedings of the 2021 USENIX Annual Technical Conference, 2021: 551-564.
[11] BEKMAN S. The technology behind bloom training[EB/OL]. [2024-05-29]. https://huggingface.co/blog/bloom-megatron-deepspeed.
[12] FAN S Q, RONG Y, MENG C, et al. DAPPLE: a pipelined data parallel approach for training large models[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2021: 431-445.
[13] NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles. New York: ACM, 2019: 1-15.
[14] SMITH S, PATWARY M, NORICK B, et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2201.11990.
[15] ZHANG S S, ROLLER S, GOYAL N, et al. OPT: open pre-trained transformer language models[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2205.01068.
[16] ZHENG L, LI Z, ZHANG H, et al. Alpa: automating inter-and intra-operator parallelism for distributed deep learning[C]//Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022: 559-578.
[17] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM[C]//Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2021: 1-14.
[18] GUPTA S, PATEL T, ENGELMANN C, et al. Failures in large scale systems: long-term measurement, analysis, and implications[C]//Proceedings of the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2017: 1-12.
[19] JEON M, VENKATARAMAN S, PHANISHAYEE A, et al. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads[C]//Proceedings of the 2019 USENIX Annual Technical Conference, 2019: 947-960.
[20] WENG Q, XIAO W, YU Y, et al. MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters[C]//Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022: 945-960.
[21] DUBEY A, JAUHRI A, PANDEY A, et al. The Llama 3 herd of models[EB/OL]. [2024-05-31]. https://arxiv.org/abs/2407.21783.
[22] NVIDIA. Introducing NVIDIA Merlin HugeCTR: a training framework dedicated to recommender systems[EB/OL]. [2024-05-31]. https://developer.nvidia.com/blog/introducing-merlin-hugectr-training-framework-dedicated-to-recommender-systems/.
[23] NAUMOV M, KIM J, MUDIGERE D, et al. Deep learning training in facebook data centers: design of scale-up and scale-out systems[EB/OL]. [2024-05-31]. https://arxiv.org/abs/2003.09518.
[24] MAENG K, BHARUKA S, GAO I, et al. CPR: understanding and improving failure tolerant training for deep learning recommendation with partial recovery[J]. Proceedings of Machine Learning and Systems, 2021, 3: 637-651.
[25] IBRAHIM Y, WANG H B, LIU J Y, et al. Soft errors in DNN accelerators: a comprehensive review[J]. Microelectronics Reliability, 2020, 115: 113969.
[26] MITTAL S. A survey on modeling and improving reliability of DNN algorithms and accelerators[J]. Journal of Systems Architecture, 2020, 104: 101689.
[27] OUYANG S, DONG D Z, XU Y M, et al. Communication optimization strategies for distributed deep neural network training: a survey[J]. Journal of Parallel and Distributed Computing, 2021, 149: 52-65.
[28] LIANG P, TANG Y, ZHANG X D, et al. A survey on auto-parallelism of large-scale deep learning training[J]. IEEE Transactions on Parallel and Distributed Systems, 2023, 34(8): 2377-2390.
[29] 冯杨洋, 汪庆, 谢旻晖, 等. 从BERT到ChatGPT: 大模型训练中的存储系统挑战与技术发展[J]. 计算机研究与发展, 2024, 61(4): 809-823.
FENG Y Y, WANG Q, XIE M H, et al. From BERT to ChatGPT: challenges and technical development of storage systems for large model training[J]. Journal of Computer Research and Development, 2024, 61(4): 809-823.
[30] MOHAN J, PHANISHAYEE A, CHIDAMBARAM V. CheckFreq: frequent, fine-grained DNN checkpointing[C]//Proceedings of the 19th USENIX Conference on File and Storage Technologies, 2021: 203-216.
[31] EISENMAN A, MATAM K K, INGRAM S, et al. Check-N-Run: a checkpointing system for training deep learning recommendation models[C]//Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022: 929-943.
[32] CHEN M L, HUA Y, BAI R, et al. A cost-efficient failure-tolerant scheme for distributed DNN training[C]//Proceedings of the 2023 IEEE 41st International Conference on Computer Design. Piscataway: IEEE, 2023: 150-157.
[33] WANG Z, JIA Z, ZHENG S, et al. GEMINI: fast failure recovery in distributed training with in-memory checkpoints[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 364-381.
[34] WANG Y X, SHI S, HE X, et al. Reliable and efficient in-memory fault tolerance of large language model pretraining[EB/OL]. [2024-05-28]. https://arxiv.org/abs/2310.12670.
[35] LF AI & Data Foundation. Elastic Horovod[EB/OL]. [2024-05-31]. https://horovod.readthedocs.io/en/stable/elastic_include.html.
[36] LI J L, BOSILCA G, BOUTEILLER A, et al. Elastic deep learning through resilient collective operations[C]//Proceedings of the 2023 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. New York: ACM, 2023: 44-50.
[37] HSIEH T T, LEE C R. Voda: a GPU scheduling platform for elastic deep learning in Kubernetes clusters[C]//Proceedings of the 2023 IEEE International Conference on Cloud Engineering. Piscataway: IEEE, 2023: 131-140.
[38] ZHOU J, ZHANG K, ZHU F, et al. ElasticDL: a Kubernetes- native deep learning framework with fault-tolerance and elastic scheduling[C]//Proceedings of the 16th ACM International Conference on Web Search and Data Mining. New York: ACM, 2023: 1148-1151.
[39] WANG Q L, SANG B, ZHANG H T, et al. DLRover: an elastic deep training extension with auto job resource recommendation[EB/OL]. [2024-06-02]. https://arxiv.org/abs/2304.01468.
[40] XIE L, ZHAI J D, WU B D, et al. Elan: towards generic and efficient elastic training for deep learning[C]//Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems. Piscataway: IEEE, 2020: 78-88.
[41] WAGENL?NDER M, LI G, ZHAO B, et al. Tenplex: dynamic parallelism for deep learning using parallelizable tensor collections[EB/OL]. [2024-06-05]. https://arxiv.org/abs/2312.05181.
[42] ATHLUR S, SARAN N, SIVATHANU M, et al. Varuna: scalable, low-cost training of massive deep learning models[C]//Proceedings of the 17th European Conference on Computer Systems. New York: ACM, 2022: 472-487.
[43] THORPE J, ZHAO P Z, EYOLFSON J, et al. Bamboo: making preemptible instances resilient for affordable training of large DNNs[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 497-513.
[44] JANG I, YANG Z N, ZHANG Z, et al. Oobleck: resilient distributed training of large models using pipeline templates[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 382-395.
[45] ZHONG Y C, SHENG G M, LIU J C, et al. Swift: expedited failure recovery for large-scale DNN training[J]. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(9): 1644-1656.
[46] AGARWAL S, YAN C P, ZHANG Z Y, et al. Bagpipe: accelerating deep recommendation model training[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 348-363.
[47] ZHANG T Y, LIU K G, KOSAIAN J, et al. Efficient fault tolerance for recommendation model training via erasure coding[J]. Proceedings of the VLDB Endowment, 2023, 16(11): 3137-3150.
[48] JIANG Z H, LIN H B, ZHONG Y M, et al. MegaScale: scaling large language model training to more than 10,000 GPUs [EB/OL]. [2024-06-05]. https://arxiv.org/abs/2402.15627.
[49] WU B D, XIA L, LI Q P, et al. TRANSOM: an efficient fault-tolerant system for training LLMs[EB/OL]. [2024-06-05]. https://arxiv.org/abs/2310.10046.
[50] TANG Z H, WANG Y X, HE X, et al. FusionAI: decentralized training and deploying LLMs with massive consumer-level GPUs[EB/OL]. [2024-05-19]. https://arxiv.org/abs/2309. 01172.
[51] WU B, ZHANG Z, BAI Z, et al. Transparent GPU sharing in container clouds for deep learning workloads[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 69-85.
[52] ZHAO H Y, YANG Z, CHENG Y, et al. GoldMiner: elastic scaling of training data pre-processing pipelines for deep learning[J]. Proceedings of the ACM on Management of Data, 2023, 1(2): 1-25.
[53] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, 2020: 1877-1901.
[54] 田海东, 张明政, 常锐, 等. 大模型训练技术综述[J]. 中兴通讯技术, 2024, 30(2): 21-28.
TIAN H D, ZHANG M Z, CHANG R, et al. A survey on large model training technologies[J]. ZTE Technology Journal, 2024, 30(2): 21-28.
[55] OSTROUCHOV G, MAXWELL D, ASHRAF R A, et al. GPU lifetimes on Titan supercomputer: survival analysis and reliability[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-14.
[56] TAHERIN A, PATEL T, GEORGAKOUDIS G, et al. Examining failures and repairs on supercomputers with multi-GPU compute nodes[C]//Proceedings of the 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2021: 305-313.
[57] TIWARI D, GUPTA S, ROGERS J, et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation[C]//Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture. Piscataway: IEEE, 2015: 331-342.
[58] GILL P, JAIN N, NAGAPPAN N. Understanding network failures in data centers: measurement, analysis, and implications[C]//Proceedings of the ACM SIGCOMM 2011 Conference. New York: ACM, 2011: 350-361.
[59] TAN C, JIN Z, GUO C, et al. NetBouncer: active device and link failure localization in data center networks[C]//Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, 2019: 599-614.
[60] ABADI M, BARHAM P, CHEN J M, et al. TensorFlow: a system for large-scale machine learning[C]//Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016: 265-283.
[61] PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[C]//Advances in Neural Information Processing Systems 32, 2019.
[62] CHEN T Q, LI M, LI Y T, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems[EB/OL]. [2024-06-05]. https://arxiv.org/abs/1512.01274.
[63] WU S H, ZHAO X D, WANG S L, et al. YUAN 2.0: a large language model with localized filtering-based attention[EB/OL]. [2024-06-08]. https://arxiv.org/abs/2311.15786.
[64] 郑纬民. 构建支持大模型训练的计算机系统需要考虑的4个问题[J]. 大数据, 2024, 10(1): 1-8.
ZHENG W M. Four issues to consider in building a computer system supporting large model training[J]. Big Data Research, 2024, 10(1): 1-8.
[65] NICOLAE B, LI J L, WOZNIAK J M, et al. DeepFreeze: towards scalable asynchronous checkpointing of deep learning models[C]//Proceedings of the 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing. Piscataway: IEEE, 2020: 172-181.
[66] GUAN H, MALEVICH A, YANG J Y, et al. Post-training 4-bit quantization on embedding tables[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1911.02079.
[67] YANG J, KIM J, HOSEINZADEH M, et al. An empirical guide to the behavior and use of scalable persistent memory[C]//Proceedings of the 18th USENIX Conference on File and Storage Technologies, 2020: 169-182.
[68] Meta. TorchSnapshot: a performant, memory-efficient checkpointing library for PyTorch applications[CP/OL]. [2024-06-08]. https://github.com/pytorch/torchsnapshot.
[69] NICOLAE B, MOODY A, GONSIOROWSKI E, et al. VeloC: towards high performance adaptive asynchronous checkpointing at large scale[C]//Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium. Piscataway: IEEE, 2019: 911-920.
[70] HUANG Y, CHENG Y, BAPNA A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism[C]//Advances in Neural Information Processing Systems 32, 2019.
[71] NICOLAE B, HOBSON T, YILDIZ O, et al. Towards low-overhead resilience for data parallel deep learning[C]//Proceedings of the 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing. Piscataway: IEEE, 2022: 336-345.
[72] YOU Y, LI J, REDDI S, et al. Large batch optimization for deep learning: training BERT in 76 minutes[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1904.00962.
[73] REDDI S J, KALE S, KUMAR S, et al. On the convergence of Adam and beyond[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1904.09237.
[74] KIM K H, JEONG C S. Optimizing single DGX-A100 system: overcoming GPU limitations via efficient parallelism and scheduling for large language models[J]. Applied Sciences, 2023, 13(16): 9306.
[75] WANG W, GHOBADI M, SHAKERI K, et al. Optimized network architectures for large language model training with billions of parameters[EB/OL]. [2024-06-08]. https://arxiv.org/abs/2307.12169, 2023.
[76] NVIDIA. NVIDIA DGX SuperPOD: scalable infrastructure for AI leadership[EB/OL]. [2024-06-08]. https://docscontent.nvidia.com/67/d5/e40a7c5c45368111458d846abfe3/tme123-ra09950001-dspa100-refarch.pdf.
[77] BAI W, ABDEEN S S, AGRAWAL A, et al. Empowering azure storage with RDMA[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 49-67.
[78] WANG W, KHAZRAEE M, ZHONG Z, et al. TopoOpt: co-optimizing network topology and parallelization strategy for distributed training jobs[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 739-767.
[79] XIAO W, REN S, LI Y, et al. AntMan: dynamic scaling on GPU clusters for deep learning[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020: 533-548.
[80] NVIDIA. CUDA multi-process service[EB/OL]. [2024-06-10]. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.
[81] NVIDIA. NVIDIA multi-instance GPU (MIG)[EB/OL]. [2024-06-10]. https://www.nvidia.com/en-us/technologies/multi-instance-gpu/.
[82] WU S X, ZHAI Y J, LIU J Y, et al. Anatomy of high-performance GEMM with online fault tolerance on GPUs[C]//Proceedings of the 37th International Conference on Supercomputing. New York: ACM, 2023: 360-372.
[83] XUE X H, LIU C, MIN F, et al. ApproxABFT: approximate algorithm-based fault tolerance for neural network processing[EB/OL]. [2024-06-10]. https://arxiv.org/abs/2302.10469.