
Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (7): 1771-1788.DOI: 10.3778/j.issn.1673-9418.2406096
• Frontiers·Surveys • Previous Articles Next Articles
XU Guangyuan, ZHANG Yaqiang, SHI Hongzhi
Online:2025-07-01
Published:2025-06-30
许光远,张亚强,史宏志
XU Guangyuan, ZHANG Yaqiang, SHI Hongzhi. Review of Fault-Tolerant Technologies for Large-Scale DNN Training Scenarios[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1771-1788.
许光远, 张亚强, 史宏志. 面向大规模DNN训练场景的容错技术综述[J]. 计算机科学与探索, 2025, 19(7): 1771-1788.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2406096
| [1] OpenAI. ChatGPT[EB/OL]. [2024-05-19]. https://chat.openai. com/. [2] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. [2024-05-19]. https://arxiv.org/abs/2303.08774. [3] OpenAI. Sora: creating video from text[EB/OL]. [2024-05-19]. https://openai.com/index/sora/. [4] CHOWDHERY A, NARANG S R, DEVLIN J, et al. PaLM: scaling language modeling with pathways[J]. The Journal of Machine Learning Research, 2023, 24(1): 11324-11436. [5] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9. [6] JIA X Y, JIANG L, WANG A, et al. Whale: efficient giant model training over heterogeneous GPUs[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2011.09208. [7] LIAN X R, YUAN B H, ZHU X F, et al. Persia: an open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2022: 3288-3298. [8] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-16. [9] RAJBHANDARI S, RUWASE O, RASLEY J, et al. ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning[C]//Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021: 1-14. [10] REN J, RAJBHANDARI S, AMINABADI R Y, et al. ZeRO-Offload: democratizing billion-scale model training[C]//Proceedings of the 2021 USENIX Annual Technical Conference, 2021: 551-564. [11] BEKMAN S. The technology behind bloom training[EB/OL]. [2024-05-29]. https://huggingface.co/blog/bloom-megatron-deepspeed. [12] FAN S Q, RONG Y, MENG C, et al. DAPPLE: a pipelined data parallel approach for training large models[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2021: 431-445. [13] NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles. New York: ACM, 2019: 1-15. [14] SMITH S, PATWARY M, NORICK B, et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2201.11990. [15] ZHANG S S, ROLLER S, GOYAL N, et al. OPT: open pre-trained transformer language models[EB/OL]. [2024-05-29]. https://arxiv.org/abs/2205.01068. [16] ZHENG L, LI Z, ZHANG H, et al. Alpa: automating inter-and intra-operator parallelism for distributed deep learning[C]//Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022: 559-578. [17] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM[C]//Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2021: 1-14. [18] GUPTA S, PATEL T, ENGELMANN C, et al. Failures in large scale systems: long-term measurement, analysis, and implications[C]//Proceedings of the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2017: 1-12. [19] JEON M, VENKATARAMAN S, PHANISHAYEE A, et al. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads[C]//Proceedings of the 2019 USENIX Annual Technical Conference, 2019: 947-960. [20] WENG Q, XIAO W, YU Y, et al. MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters[C]//Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022: 945-960. [21] DUBEY A, JAUHRI A, PANDEY A, et al. The Llama 3 herd of models[EB/OL]. [2024-05-31]. https://arxiv.org/abs/2407.21783. [22] NVIDIA. Introducing NVIDIA Merlin HugeCTR: a training framework dedicated to recommender systems[EB/OL]. [2024-05-31]. https://developer.nvidia.com/blog/introducing-merlin-hugectr-training-framework-dedicated-to-recommender-systems/. [23] NAUMOV M, KIM J, MUDIGERE D, et al. Deep learning training in facebook data centers: design of scale-up and scale-out systems[EB/OL]. [2024-05-31]. https://arxiv.org/abs/2003.09518. [24] MAENG K, BHARUKA S, GAO I, et al. CPR: understanding and improving failure tolerant training for deep learning recommendation with partial recovery[J]. Proceedings of Machine Learning and Systems, 2021, 3: 637-651. [25] IBRAHIM Y, WANG H B, LIU J Y, et al. Soft errors in DNN accelerators: a comprehensive review[J]. Microelectronics Reliability, 2020, 115: 113969. [26] MITTAL S. A survey on modeling and improving reliability of DNN algorithms and accelerators[J]. Journal of Systems Architecture, 2020, 104: 101689. [27] OUYANG S, DONG D Z, XU Y M, et al. Communication optimization strategies for distributed deep neural network training: a survey[J]. Journal of Parallel and Distributed Computing, 2021, 149: 52-65. [28] LIANG P, TANG Y, ZHANG X D, et al. A survey on auto-parallelism of large-scale deep learning training[J]. IEEE Transactions on Parallel and Distributed Systems, 2023, 34(8): 2377-2390. [29] 冯杨洋, 汪庆, 谢旻晖, 等. 从BERT到ChatGPT: 大模型训练中的存储系统挑战与技术发展[J]. 计算机研究与发展, 2024, 61(4): 809-823. FENG Y Y, WANG Q, XIE M H, et al. From BERT to ChatGPT: challenges and technical development of storage systems for large model training[J]. Journal of Computer Research and Development, 2024, 61(4): 809-823. [30] MOHAN J, PHANISHAYEE A, CHIDAMBARAM V. CheckFreq: frequent, fine-grained DNN checkpointing[C]//Proceedings of the 19th USENIX Conference on File and Storage Technologies, 2021: 203-216. [31] EISENMAN A, MATAM K K, INGRAM S, et al. Check-N-Run: a checkpointing system for training deep learning recommendation models[C]//Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022: 929-943. [32] CHEN M L, HUA Y, BAI R, et al. A cost-efficient failure-tolerant scheme for distributed DNN training[C]//Proceedings of the 2023 IEEE 41st International Conference on Computer Design. Piscataway: IEEE, 2023: 150-157. [33] WANG Z, JIA Z, ZHENG S, et al. GEMINI: fast failure recovery in distributed training with in-memory checkpoints[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 364-381. [34] WANG Y X, SHI S, HE X, et al. Reliable and efficient in-memory fault tolerance of large language model pretraining[EB/OL]. [2024-05-28]. https://arxiv.org/abs/2310.12670. [35] LF AI & Data Foundation. Elastic Horovod[EB/OL]. [2024-05-31]. https://horovod.readthedocs.io/en/stable/elastic_include.html. [36] LI J L, BOSILCA G, BOUTEILLER A, et al. Elastic deep learning through resilient collective operations[C]//Proceedings of the 2023 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. New York: ACM, 2023: 44-50. [37] HSIEH T T, LEE C R. Voda: a GPU scheduling platform for elastic deep learning in Kubernetes clusters[C]//Proceedings of the 2023 IEEE International Conference on Cloud Engineering. Piscataway: IEEE, 2023: 131-140. [38] ZHOU J, ZHANG K, ZHU F, et al. ElasticDL: a Kubernetes- native deep learning framework with fault-tolerance and elastic scheduling[C]//Proceedings of the 16th ACM International Conference on Web Search and Data Mining. New York: ACM, 2023: 1148-1151. [39] WANG Q L, SANG B, ZHANG H T, et al. DLRover: an elastic deep training extension with auto job resource recommendation[EB/OL]. [2024-06-02]. https://arxiv.org/abs/2304.01468. [40] XIE L, ZHAI J D, WU B D, et al. Elan: towards generic and efficient elastic training for deep learning[C]//Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems. Piscataway: IEEE, 2020: 78-88. [41] WAGENL?NDER M, LI G, ZHAO B, et al. Tenplex: dynamic parallelism for deep learning using parallelizable tensor collections[EB/OL]. [2024-06-05]. https://arxiv.org/abs/2312.05181. [42] ATHLUR S, SARAN N, SIVATHANU M, et al. Varuna: scalable, low-cost training of massive deep learning models[C]//Proceedings of the 17th European Conference on Computer Systems. New York: ACM, 2022: 472-487. [43] THORPE J, ZHAO P Z, EYOLFSON J, et al. Bamboo: making preemptible instances resilient for affordable training of large DNNs[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 497-513. [44] JANG I, YANG Z N, ZHANG Z, et al. Oobleck: resilient distributed training of large models using pipeline templates[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 382-395. [45] ZHONG Y C, SHENG G M, LIU J C, et al. Swift: expedited failure recovery for large-scale DNN training[J]. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(9): 1644-1656. [46] AGARWAL S, YAN C P, ZHANG Z Y, et al. Bagpipe: accelerating deep recommendation model training[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM, 2023: 348-363. [47] ZHANG T Y, LIU K G, KOSAIAN J, et al. Efficient fault tolerance for recommendation model training via erasure coding[J]. Proceedings of the VLDB Endowment, 2023, 16(11): 3137-3150. [48] JIANG Z H, LIN H B, ZHONG Y M, et al. MegaScale: scaling large language model training to more than 10,000 GPUs [EB/OL]. [2024-06-05]. https://arxiv.org/abs/2402.15627. [49] WU B D, XIA L, LI Q P, et al. TRANSOM: an efficient fault-tolerant system for training LLMs[EB/OL]. [2024-06-05]. https://arxiv.org/abs/2310.10046. [50] TANG Z H, WANG Y X, HE X, et al. FusionAI: decentralized training and deploying LLMs with massive consumer-level GPUs[EB/OL]. [2024-05-19]. https://arxiv.org/abs/2309. 01172. [51] WU B, ZHANG Z, BAI Z, et al. Transparent GPU sharing in container clouds for deep learning workloads[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 69-85. [52] ZHAO H Y, YANG Z, CHENG Y, et al. GoldMiner: elastic scaling of training data pre-processing pipelines for deep learning[J]. Proceedings of the ACM on Management of Data, 2023, 1(2): 1-25. [53] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, 2020: 1877-1901. [54] 田海东, 张明政, 常锐, 等. 大模型训练技术综述[J]. 中兴通讯技术, 2024, 30(2): 21-28. TIAN H D, ZHANG M Z, CHANG R, et al. A survey on large model training technologies[J]. ZTE Technology Journal, 2024, 30(2): 21-28. [55] OSTROUCHOV G, MAXWELL D, ASHRAF R A, et al. GPU lifetimes on Titan supercomputer: survival analysis and reliability[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-14. [56] TAHERIN A, PATEL T, GEORGAKOUDIS G, et al. Examining failures and repairs on supercomputers with multi-GPU compute nodes[C]//Proceedings of the 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2021: 305-313. [57] TIWARI D, GUPTA S, ROGERS J, et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation[C]//Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture. Piscataway: IEEE, 2015: 331-342. [58] GILL P, JAIN N, NAGAPPAN N. Understanding network failures in data centers: measurement, analysis, and implications[C]//Proceedings of the ACM SIGCOMM 2011 Conference. New York: ACM, 2011: 350-361. [59] TAN C, JIN Z, GUO C, et al. NetBouncer: active device and link failure localization in data center networks[C]//Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, 2019: 599-614. [60] ABADI M, BARHAM P, CHEN J M, et al. TensorFlow: a system for large-scale machine learning[C]//Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016: 265-283. [61] PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[C]//Advances in Neural Information Processing Systems 32, 2019. [62] CHEN T Q, LI M, LI Y T, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems[EB/OL]. [2024-06-05]. https://arxiv.org/abs/1512.01274. [63] WU S H, ZHAO X D, WANG S L, et al. YUAN 2.0: a large language model with localized filtering-based attention[EB/OL]. [2024-06-08]. https://arxiv.org/abs/2311.15786. [64] 郑纬民. 构建支持大模型训练的计算机系统需要考虑的4个问题[J]. 大数据, 2024, 10(1): 1-8. ZHENG W M. Four issues to consider in building a computer system supporting large model training[J]. Big Data Research, 2024, 10(1): 1-8. [65] NICOLAE B, LI J L, WOZNIAK J M, et al. DeepFreeze: towards scalable asynchronous checkpointing of deep learning models[C]//Proceedings of the 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing. Piscataway: IEEE, 2020: 172-181. [66] GUAN H, MALEVICH A, YANG J Y, et al. Post-training 4-bit quantization on embedding tables[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1911.02079. [67] YANG J, KIM J, HOSEINZADEH M, et al. An empirical guide to the behavior and use of scalable persistent memory[C]//Proceedings of the 18th USENIX Conference on File and Storage Technologies, 2020: 169-182. [68] Meta. TorchSnapshot: a performant, memory-efficient checkpointing library for PyTorch applications[CP/OL]. [2024-06-08]. https://github.com/pytorch/torchsnapshot. [69] NICOLAE B, MOODY A, GONSIOROWSKI E, et al. VeloC: towards high performance adaptive asynchronous checkpointing at large scale[C]//Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium. Piscataway: IEEE, 2019: 911-920. [70] HUANG Y, CHENG Y, BAPNA A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism[C]//Advances in Neural Information Processing Systems 32, 2019. [71] NICOLAE B, HOBSON T, YILDIZ O, et al. Towards low-overhead resilience for data parallel deep learning[C]//Proceedings of the 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing. Piscataway: IEEE, 2022: 336-345. [72] YOU Y, LI J, REDDI S, et al. Large batch optimization for deep learning: training BERT in 76 minutes[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1904.00962. [73] REDDI S J, KALE S, KUMAR S, et al. On the convergence of Adam and beyond[EB/OL]. [2024-06-08]. https://arxiv.org/abs/1904.09237. [74] KIM K H, JEONG C S. Optimizing single DGX-A100 system: overcoming GPU limitations via efficient parallelism and scheduling for large language models[J]. Applied Sciences, 2023, 13(16): 9306. [75] WANG W, GHOBADI M, SHAKERI K, et al. Optimized network architectures for large language model training with billions of parameters[EB/OL]. [2024-06-08]. https://arxiv.org/abs/2307.12169, 2023. [76] NVIDIA. NVIDIA DGX SuperPOD: scalable infrastructure for AI leadership[EB/OL]. [2024-06-08]. https://docscontent.nvidia.com/67/d5/e40a7c5c45368111458d846abfe3/tme123-ra09950001-dspa100-refarch.pdf. [77] BAI W, ABDEEN S S, AGRAWAL A, et al. Empowering azure storage with RDMA[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 49-67. [78] WANG W, KHAZRAEE M, ZHONG Z, et al. TopoOpt: co-optimizing network topology and parallelization strategy for distributed training jobs[C]//Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023: 739-767. [79] XIAO W, REN S, LI Y, et al. AntMan: dynamic scaling on GPU clusters for deep learning[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020: 533-548. [80] NVIDIA. CUDA multi-process service[EB/OL]. [2024-06-10]. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. [81] NVIDIA. NVIDIA multi-instance GPU (MIG)[EB/OL]. [2024-06-10]. https://www.nvidia.com/en-us/technologies/multi-instance-gpu/. [82] WU S X, ZHAI Y J, LIU J Y, et al. Anatomy of high-performance GEMM with online fault tolerance on GPUs[C]//Proceedings of the 37th International Conference on Supercomputing. New York: ACM, 2023: 360-372. [83] XUE X H, LIU C, MIN F, et al. ApproxABFT: approximate algorithm-based fault tolerance for neural network processing[EB/OL]. [2024-06-10]. https://arxiv.org/abs/2302.10469. |
| [1] | ZHANG Yifei, LI Yanling, GE Fengpei. Review of Legal Judgment Prediction Based on Graph Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2024-2042. |
| [2] | HONG Wei, GENG Peilin, WANG Hongyu, ZHANG Xueqin, GU Chunhua. Local Dynamic Clean-Label Backdoor Attack with Image Salience Regions [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2229-2240. |
| [3] | MIN Feng, LIU Yuzhuo, LIU Yuhui, LIU Biao. Multivariate Time Series Prediction Model Based on Mixed Features of Time Domain and Frequency Domain [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2099-2109. |
| [4] | DONG Jiadong, SANG Feihu, GUO Qinghu, CHEN Lin, ZHENG Chunxiang. Review of Lightweight Object Detection Algorithms Based on Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2057-2084. |
| [5] | ZHOU Kaijun, LIAO Ting, TAN Ping, SHI Changfa. Review of Research on Image Compression Techniques [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1699-1728. |
| [6] | CHEN Xu, ZHANG Qi, WANG Shuyang, JING Yongjun. Adaptive Product Space Discrete Dynamic Graph Link Prediction Model [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1820-1831. |
| [7] | XU Delong, LIN Min, WANG Yurong, ZHANG Shujun. Survey of NLP Data Augmentation Methods Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1395-1413. |
| [8] | LI Yunfei, WEI Xia, CAI Xin, LYU Mingyu, LUO Xianghan. TCTP-YOLO: Typical Obstacles and Traffic Sign Detection Methods for Blind Pedestrians [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1540-1552. |
| [9] | ZHOU Nan, DONG Yongquan, YAN Linke, JIN Jiayong, HE Bugui. Research on Exercise Recommendation Fusing Student Knowledge State and Chaotic Firefly Algorithm [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1620-1631. |
| [10] | ZHU Jiayin, LI Yang, LI Ming, MA Jingang. Review of Application of Deep Learning in Cervical Cell Segmentation [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1476-1493. |
| [11] | LIANG Jiexin, FENG Yue, LI Jianzhong, CHEN Tao, LIN Zhuosheng, HE Ying, WANG Songbai. Survey on Intelligent Identification of Constitution in Traditional Chinese Medicine [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1455-1475. |
| [12] | LYU Fu, ZHENG Yu, QI Guangyao, LI Haoran. Lightweight SAR Image Ship Oblique Frame Detection Algorithm Based on Polar Coordinate Encoding and Decoding [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1564-1579. |
| [13] | LI Shaobo, WANG Xiaoqiang, GUO Libiao, HONG Ying, WANG Zhiguo. Review of Deep Learning Applications in Unmanned Aerial Vehicle Remote Sensing Images of Grass Plants [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(5): 1157-1176. |
| [14] | YANG Zhiyong, GUO Jieru, GUO Zihang, ZHANG Ruixiang, ZHOU Yu. Review of Research on Trajectory Prediction of Road Pedestrian Behavior [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(5): 1177-1197. |
| [15] | LI Guowei, LIU Jing, CAO Hui, JIANG Liang. Research Review of Deep Learning in Colon Polyp Image Segmentation [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(5): 1198-1216. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/