计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (1): 111-126.DOI: 10.3778/j.issn.1673-9418.2209026

• 理论·算法 • 上一篇    下一篇

深度学习编译器模型训练负载均衡优化方法

王丽,高开,赵雅倩,李仁刚,曹芳,郭振华   

  1. 浪潮电子信息产业股份有限公司 高效能服务器与存储技术国家重点实验室,济南 250000
  • 出版日期:2024-01-01 发布日期:2024-01-01

Deep Learning Compiler Load Balancing Optimization Method for Model Training

WANG Li, GAO Kai, ZHAO Yaqian, LI Rengang, CAO Fang, GUO Zhenhua   

  1. State Key Laboratory of High-End & Storage Technology, Inspur Electronic Information Industry Co., Ltd., Jinan 250000, China
  • Online:2024-01-01 Published:2024-01-01

摘要: 对于计算密集型的人工智能(AI)训练应用,其计算图网络结构更加复杂,数据加载、计算图的任务划分以及任务调度的负载均衡性都会成为影响计算性能的关键因素。为了使深度学习编译器中模型训练应用的任务调度达到负载均衡的状态,提出了三种计算图负载均衡优化方法:第一,通过自动建立数据加载与模型训练的高效流水实现中央处理器和后端计算设备的负载均衡,提高了系统整体能效;第二,通过计算图的分层优化技术,实现计算图在后端设备执行调度时的负载均衡;最后,通过自动建立层间的高效流水提高后端设备的资源利用率。实验结果表明,计算图负载均衡优化方法实现了训练任务到底层硬件设备自动映射过程中系统的负载均衡,与Tensorflow、nGraph等传统的深度学习框架和编译器相比,在不同模型训练中通过任务调度负载均衡优化技术分别获得了2%~10%的性能提升,同时能够使系统整体的能耗降低10%以上。

关键词: 模型训练, 编译器优化, 负载均衡, 分层调度, 自动流水

Abstract: For computing-intensive artificial intelligence (AI) training tasks, the computational graph is more complex, and data loading, task division of the computational graph, and load balancing of task scheduling have become the key factors affecting the computing performance. This paper proposes three optimization methods to make the task scheduling of model training in deep learning compilers reach the load balance state. Firstly, the load balance between CPU and back-end computing devices is realized by automatically establishing an efficient pipeline for data loading and model training, which improves the overall energy efficiency of the system. Secondly, the layered optimization technology of computational graph is used to realize the load balance of computational graph when the back-end devices are scheduling. Finally, this paper improves the resource utilization of back-end devices by automatically establishing efficient pipeline between layers. Experimental results show that the proposed optimization method achieves the system load balancing in the process of automatically mapping the training tasks to underlying hardware devices. Compared with traditional deep learning frameworks and compilers such as TensorFlow, nGraph, etc., this paper achieves 2%~10% performance improvement in the training of different AI models, and the overall power consumption of the training system can be reduced by more than 10%.

Key words: model training, compiler optimization, load balancing, hierarchical scheduling, automatic pipelining