深度学习编译器模型训练负载均衡优化方法

doi:10.3778/j.issn.1673-9418.2209026

摘要/Abstract

摘要： 对于计算密集型的人工智能（AI）训练应用，其计算图网络结构更加复杂，数据加载、计算图的任务划分以及任务调度的负载均衡性都会成为影响计算性能的关键因素。为了使深度学习编译器中模型训练应用的任务调度达到负载均衡的状态，提出了三种计算图负载均衡优化方法：第一，通过自动建立数据加载与模型训练的高效流水实现中央处理器和后端计算设备的负载均衡，提高了系统整体能效；第二，通过计算图的分层优化技术，实现计算图在后端设备执行调度时的负载均衡；最后，通过自动建立层间的高效流水提高后端设备的资源利用率。实验结果表明，计算图负载均衡优化方法实现了训练任务到底层硬件设备自动映射过程中系统的负载均衡，与Tensorflow、nGraph等传统的深度学习框架和编译器相比，在不同模型训练中通过任务调度负载均衡优化技术分别获得了2%~10%的性能提升，同时能够使系统整体的能耗降低10%以上。

关键词: 模型训练, 编译器优化, 负载均衡, 分层调度, 自动流水

Abstract: For computing-intensive artificial intelligence (AI) training tasks, the computational graph is more complex, and data loading, task division of the computational graph, and load balancing of task scheduling have become the key factors affecting the computing performance. This paper proposes three optimization methods to make the task scheduling of model training in deep learning compilers reach the load balance state. Firstly, the load balance between CPU and back-end computing devices is realized by automatically establishing an efficient pipeline for data loading and model training, which improves the overall energy efficiency of the system. Secondly, the layered optimization technology of computational graph is used to realize the load balance of computational graph when the back-end devices are scheduling. Finally, this paper improves the resource utilization of back-end devices by automatically establishing efficient pipeline between layers. Experimental results show that the proposed optimization method achieves the system load balancing in the process of automatically mapping the training tasks to underlying hardware devices. Compared with traditional deep learning frameworks and compilers such as TensorFlow, nGraph, etc., this paper achieves 2%~10% performance improvement in the training of different AI models, and the overall power consumption of the training system can be reduced by more than 10%.

Key words: model training, compiler optimization, load balancing, hierarchical scheduling, automatic pipelining

王丽, 高开, 赵雅倩, 李仁刚, 曹芳, 郭振华. 深度学习编译器模型训练负载均衡优化方法[J]. 计算机科学与探索, 2024, 18(1): 111-126.

WANG Li, GAO Kai, ZHAO Yaqian, LI Rengang, CAO Fang, GUO Zhenhua. Deep Learning Compiler Load Balancing Optimization Method for Model Training[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 111-126.

参考文献

[1] ABADI M, BARGHAM P, CHEN J, et al. TensorFlow: a system for large-scale machine learning[C]//Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, Savannah, Nov 2-4, 2016: 265-283.
[2] PALSZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 8024-8035.
[3] CHEN T, LI M, LI Y, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed system[J]. arXiv:1512.01274v1, 2015.
[4] Baidu. PaddlePaddle, Github[EB/OL]. [2022-06-27]. https://github.com/PaddlePaddle/Paddle.
[5] LI M, LIU Y, LIU X, et al. The deep learning compiler: a comprehensive survey[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(3): 708-727.
[6] YANG B, ZHANG J, LI J, et al. PipeMare: asynchronous pipeline parallel DNN training[C]//Proceedings of the Machine Learning and Systems 2021, Apr 5-9, 2021: 269-296.
[7] WANG G, WANG K, JIANG K, et al. Wavelet: efficient DNN training with tick-tock scheduling[C]//Proceedings of the Machine Learning and Systems 2021, Apr 5-9, 2021: 696-710.
[8] LI S, HOEFLER T. Chimera: efficiently training large-scale neural networks with bidirectional pipelines[C]//Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Nov 14-19, 2021: 27.
[9] HUANG T, LIN D L, LIN C X, et al. Taskflow: a general-purpose parallel and heterogeneous task programming system[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021, 41(5): 1448-1452.
[10] NARAYANAN D, SANTHANAM K, KAZHAMIA F, et al. Heterogeneity-aware cluster scheduling policies for deep learning workloads[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2020: 481-498.
[11] XLA Team within Google. XLA: TensorFlow, compiled. Google developers blog[EB/OL]. [2022-06-24]. https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html.
[12] CHEN T, MOREAU T, JIANG Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning[C]//Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, Oct 8-10, 2018. Berkeley: USENIX Association, 2018: 578-594.
[13] ROTEM N, FIX J, ABDULRASOOL S, et al. Glow: graph lowering compiler techniques for neural networks[J]. arXiv: 1805.00907, 2018.
[14] CYPHERS S, BANSAL A K, BHIWANDIWALLA A, et al. Intel nGraph: an intermediate representation, compiler, and executor for deep learning[J]. arXiv:1801.08058, 2018.
[15] OpenVINO[EB/OL]. [2022-06-24]. https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html.
[16] ROESCH J, LYUBOMIRSKY S, WEBER L, et al. Relay: a new IR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, Philadelphia, Jun 18-22, 2018. New York: ACM, 2018: 58-68.
[17] VASILACHE N, ZINENKO O, THEODORIDIS T, et al. Tensor comprehensions: framework-agnostic high performance machine learning abstractions[J]. arXiv:1802.04730, 2018.
[18] MA L, XIE Z, YANG Z, et al. Rammer: enabling holistic deep learning compiler optimizations with rTasks[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2020: 881-897.
[19] NIU W, GUAN J, WANG Y, et al. DNNFusion: accelerating deep neural networks execution with advanced operator fusion[C]//Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Canada, Jun 20-25, 2021: 883-898.
[20] UNGER C, JIA Z H, WU W, et al. Unity: accelerating DNN training through joint optimization of algebraic transformations and parallelization[C]//Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2022: 267-284.
[21] ZHAO T, HALL M, JOHANSEN H, et al. Improving communication by optimizing on-node data movement with data layout[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 27-Mar 3, 2021: 304-317.
[22] YUAN J, LI X Q, CHENG C, et al. OneFlow: redesign the distributed deep learning framework from scratch[J]. arXiv: 2110.15032, 2021.
[23] XIAO W, BHARDWAJ R, RAMJEE R, et al. Gandiva: introspective cluster scheduling for deep learning[C]//Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, Oct 8-10, 2018: 595-610.
[24] WOO T, HAVEMANN M, FRIEDMAN B D, et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, Nov 9-19, 2020: 90.
[25] HUANG Y, CHENG Y, CHEN D, et al. Gpipe: efficient training of giant neural networks using pipeline parallelism[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 103-112.
[26] FAN S, RONG Y, MENG C, et al. DAPPLE: a pipelined data parallel approach for training large models[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 27-Mar 3, 2021: 431-445.
[27] 常爽爽, 赵栩锋, 刘震宇, 等. 基于异构多核的多类型DAG任务的响应时间分析[J]. 计算机学报, 2020, 43(6): 1052-1068.
CHANG S S, ZHAO X F, LIU Z Y, et al. Response time analysis of typed DAG tasks on heterogeneous multi-cores[J]. Chinese Journal of Computers, 2020, 43(6): 1052-1068.
[28] JIA Z, PADON O, THOMAS J, et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles, Huntsville, Oct 27-30, 2019: 47-62.