A review of fault-tolerant technologies for large-scale DNN training scenarios

doi:10.3778/j.issn.1673-9418.2406096

Abstract

Abstract: Large-scale computing clusters composed of heterogeneous resources are essential for training large deep neural networks such as ChatGPT and Sora. However, the failure rate during the training process tends to be positively correlated with the scale of the training. A failure in any component of the cluster may lead to an interruption of the training task, making it crucial to implement efficient fault-tolerance mechanisms in large-scale deep neural network training. In summary, this paper delves into fault-tolerant technologies developed in recent years for large-scale neural network training, with a focus on how to effectively address training failures at different levels, as well as the potential advantages and limitations of these technologies.
This paper first explains the critical role of fault-tolerance techniques in large-scale deep neural network training and then discusses recent advancements in fault-tolerance technologies for large-scale deep neural network training. These are categorized into two levels based on their focus: fault-tolerance in the training process and in the system architecture and modules. In terms of fault-tolerance design for the training process, the paper covers both checkpoint-based and non-checkpoint-based techniques. Checkpoint-based fault tolerance aims to optimize the storage and transmission of checkpoints, minimizing data loss and recovery time in the event of failure, while non-checkpoint techniques rely on elastic training, redundant computing, and parameter update strategies, providing more flexible failure recovery mechanisms.In terms of fault-tolerant design for the system architecture and modules, this paper explores recent fault detection and management technologies that ensure the stability of large-scale clusters. Additionally, the paper discusses fault-tolerance measures for modules such as containers, data preprocessing, and matrix multiplication, which improve the efficiency and fault-tolerance capabilities of training tasks.
Finally, the paper summarizes and analyzes fault-tolerance technologies proposed in recent years, outlines the current challenges in large-scale deep learning training, and offers insights into future directions for optimization to meet the fault-tolerance demands of larger-scale deep learning training scenarios.

Key words: fault-tolerance, deep learning, model training, ChatGPT, large model

摘要： 由大量异构资源构建的大规模算力集群是训练ChatGPT、Sora等大型深度神经网络的必须品。然而，训练过程中的故障率往往与训练规模成正相关。集群中任一部件故障可能会导致训练任务中断，因此在大规模深度神经网络训练中实现高效容错能力变得尤为重要。综上，本文围绕近年针对大规模神经网络训练的容错技术开展深入探讨，重点讨论了不同层面上如何有效应对训练过程中出现的故障及相关技术可能存在的优劣势。
本文首先说明了容错技术在大规模深度神经网络训练中所起的关键作用，之后阐述了近年来有关大规模深度神经网络训练的容错技术，根据作用主体从训练过程和训练体系架构及模块两个层面展开论述。针对训练过程中的容错设计，主要包括检查点和非检查点容错技术。检查点容错技术旨在优化检查点的存储和传输，以减少故障发生时的数据丢失和训练恢复时间，而非检查点技术则依赖于弹性训练、冗余计算及参数更新策略，提供更加灵活的故障恢复机制。在训练体系架构和模块的容错设计中，本文探讨了近年确保大规模集群稳定性的故障检测和管理等容错技术。此外，还讨论了容器、数据预处理和矩阵乘法等模块的容错措施，以提升训练任务的效率与容错能力。
最后，本文通过归纳和分析近年提出的容错技术，总结了面向大规模深度学习训练场景的现有挑战，并展望了未来的发展方向，提出了相关优化方向，以适应更大规模深度学习训练场景的容错需求。

关键词: 容错, 深度学习, 模型训练, ChatGPT, 大模型

XU Guangyuan, ZHANG Yaqiang, SHI Hongzhi. A review of fault-tolerant technologies for large-scale DNN training scenarios[J]. Journal of Frontiers of Computer Science and Technology, DOI: 10.3778/j.issn.1673-9418.2406096.

许光远, 张亚强, 史宏志. 面向大规模DNN训练场景的容错技术综述[J]. 计算机科学与探索, DOI: 10.3778/j.issn.1673-9418.2406096.

[1]	LI Ziqi, SU Yuxuan, SUN Jun, ZHANG Yonghong, XIA Qingfeng, YIN Hefeng. Critical Review of Multi-focus Image Fusion Based on Deep Learning Method [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2276-2292.
[2]	LIAN Zhe, YIN Yanjun, ZHI Min, XU Qiaozhi. Review of Differentiable Binarization Techniques for Text Detection in Natural Scenes [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2239-2260.
[3]	FANG Boru, QIU Dawei, BAI Yang, LIU Jing. Review of Application of Surface Electromyography Signals in Muscle Fatigue Research [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2261-2275.
[4]	XU Zhiwei, LI Hailong, LI Bo, LI Tao, WANG Jiatai, XIE Xueshuo, DONG Zehui. Survey of AIGC Large Model Evaluation: Enabling Technologies, Vulnerabilities and Mitigation [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2293-2325.
[5]	YE Qingwen, ZHANG Qiuju. Multi-label Image Recognition Using Channel Pixel Attention [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 2109-2117.
[6]	WANG Yousong, PEI Junpeng, LI Zenghui, WANG Wei. Review of Research on Deep Learning in Retinal Blood Vessel Segmentation [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 1960-1978.
[7]	HAN Han, HUANG Xunhua, CHANG Huihui, FAN Haoyi, CHEN Peng, CHEN Jijia. Review of Self-supervised Learning Methods in Field of ECG [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1683-1704.
[8]	LI Jiancheng, CAO Lu, HE Xiquan, LIAO Junhong. Review of Classification Methods for Lung Nodules in CT Images [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1705-1724.
[9]	HOU Xin, WANG Yan, WANG Xuan, FAN Wei. Review of Application Progress of Panoramic Imagery in Urban Research [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1661-1682.
[10]	JIANG Jian, ZHANG Qi, WANG Caiyong. Review of Deep Learning Based Iris Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(6): 1421-1437.
[11]	PU Qiumei, YIN Shuai, LI Zhengmao, ZHAO Lina. Review of U-Net-Based Convolutional Neural Networks for Breast Medical Image Segmentation [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(6): 1383-1403.
[12]	ZHANG Kaili, WANG Anzhi, XIONG Yawei, LIU Yun. Survey of Transformer-Based Single Image Dehazing Methods [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1182-1196.
[13]	YU Fan, ZHANG Jing. Dense Pedestrian Detection Based on Shifted Window Attention Multi-scale Equalization [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1286-1300.
[14]	ZENG Fanzhi, FENG Wenjie, ZHOU Yan. Survey on Natural Scene Text Recognition Methods of Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1160-1181.
[15]	SUN Shuifa, TANG Yongheng, WANG Ben, DONG Fangmin, LI Xiaolong, CAI Jiacheng, WU Yirong. Review of Research on 3D Reconstruction of Dynamic Scenes [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(4): 831-860.

A review of fault-tolerant technologies for large-scale DNN training scenarios

面向大规模DNN训练场景的容错技术综述

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics