A review of fault-tolerant technologies for large-scale DNN training scenarios
XU Guangyuan, ZHANG Yaqiang, SHI Hongzhi
Journal of Frontiers of Computer Science and Technology . 0, (): 1 -21 .  DOI: 10.3778/j.issn.1673-9418.2406096