Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

A review of fault-tolerant technologies for large-scale DNN training scenarios

XU Guangyuan,  ZHANG Yaqiang,  SHI Hongzhi   

  1. Shandong Massive Information Technology Research Institute, Jinan  250101, China

面向大规模DNN训练场景的容错技术综述

许光远, 张亚强, 史宏志   

  1. 山东海量信息技术研究院, 济南 250101

Abstract: Large-scale computing clusters composed of heterogeneous resources are essential for training large deep neural networks such as ChatGPT and Sora. However, the failure rate during the training process tends to be positively correlated with the scale of the training. A failure in any component of the cluster may lead to an interruption of the training task, making it crucial to implement efficient fault-tolerance mechanisms in large-scale deep neural network training. In summary, this paper delves into fault-tolerant technologies developed in recent years for large-scale neural network training, with a focus on how to effectively address training failures at different levels, as well as the potential advantages and limitations of these technologies.
This paper first explains the critical role of fault-tolerance techniques in large-scale deep neural network training and then discusses recent advancements in fault-tolerance technologies for large-scale deep neural network training. These are categorized into two levels based on their focus: fault-tolerance in the training process and in the system architecture and modules. In terms of fault-tolerance design for the training process, the paper covers both checkpoint-based and non-checkpoint-based techniques. Checkpoint-based fault tolerance aims to optimize the storage and transmission of checkpoints, minimizing data loss and recovery time in the event of failure, while non-checkpoint techniques rely on elastic training, redundant computing, and parameter update strategies, providing more flexible failure recovery mechanisms.In terms of fault-tolerant design for the system architecture and modules, this paper explores recent fault detection and management technologies that ensure the stability of large-scale clusters. Additionally, the paper discusses fault-tolerance measures for modules such as containers, data preprocessing, and matrix multiplication, which improve the efficiency and fault-tolerance capabilities of training tasks.
Finally, the paper summarizes and analyzes fault-tolerance technologies proposed in recent years, outlines the current challenges in large-scale deep learning training, and offers insights into future directions for optimization to meet the fault-tolerance demands of larger-scale deep learning training scenarios.

Key words: fault-tolerance, deep learning, model training, ChatGPT, large model

摘要: 由大量异构资源构建的大规模算力集群是训练ChatGPT、Sora等大型深度神经网络的必须品。然而,训练过程中的故障率往往与训练规模成正相关。集群中任一部件故障可能会导致训练任务中断,因此在大规模深度神经网络训练中实现高效容错能力变得尤为重要。综上,本文围绕近年针对大规模神经网络训练的容错技术开展深入探讨,重点讨论了不同层面上如何有效应对训练过程中出现的故障及相关技术可能存在的优劣势。
本文首先说明了容错技术在大规模深度神经网络训练中所起的关键作用,之后阐述了近年来有关大规模深度神经网络训练的容错技术,根据作用主体从训练过程和训练体系架构及模块两个层面展开论述。针对训练过程中的容错设计,主要包括检查点和非检查点容错技术。检查点容错技术旨在优化检查点的存储和传输,以减少故障发生时的数据丢失和训练恢复时间,而非检查点技术则依赖于弹性训练、冗余计算及参数更新策略,提供更加灵活的故障恢复机制。在训练体系架构和模块的容错设计中,本文探讨了近年确保大规模集群稳定性的故障检测和管理等容错技术。此外,还讨论了容器、数据预处理和矩阵乘法等模块的容错措施,以提升训练任务的效率与容错能力。
最后,本文通过归纳和分析近年提出的容错技术,总结了面向大规模深度学习训练场景的现有挑战,并展望了未来的发展方向,提出了相关优化方向,以适应更大规模深度学习训练场景的容错需求。

关键词: 容错, 深度学习, 模型训练, ChatGPT, 大模型