Journal of Frontiers of Computer Science and Technology ›› 2018, Vol. 12 ›› Issue (10): 1645-1657.DOI: 10.3778/j.issn.1673-9418.1711010

Previous Articles     Next Articles

Analyzing Performance of Neural Networks in Training Phase

LI Jingjun+, ZHANG Chen, CAO Qiang   

  1. Key Laboratory of Information Storage System, Ministry of Education of China. Wuhan National Laboratory for   Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
  • Online:2018-10-01 Published:2018-10-08

面向训练阶段的神经网络性能分析

李景军+,张  宸,曹  强   

  1. 华中科技大学 武汉光电国家研究中心,武汉 430074

Abstract: Recently, the neural networks have increasingly delopyed in many fields. However, as complexity of neural networks grows, graphics processing units (GPUs) begin to be applied in deep learning. Though GPUs have     exhibited excellent performance on accelerating matrix multiplication, the real computing resources and memory    resources of GPUs have not been fully utilized in the compute-intensive neural network training phase due to the complexity and diversity of network models. This paper focuses on doing an experimental and fine-grained performance analysis for deep neural network models. First, it divides the training phase into six stages in the sight of data flow and measures the latency of each stage. And then, it presents a quantitative analysis for GPU compute efficiency and resource utilization in each layer from point of views of GPU-accelerated libraries, neural network models, and batch size. Finally, weights and feature maps of each layer are given quantitatively to reveal the GPU memory utilization. These experiments and analysis show that (1) The compute efficiency of cuDNN in convolution layers is 2 times than cuBLAS. (2) The resource utilization of convolution layers is 50% higher than full-connected layers. (3) The GPU memory utilization in different layers are varied, and the overall utilization is not high, no more than 20% of the total memory space.

Key words: network models, graphics processing unit (GPU), resource utilization, compute efficiency, data flow, GPU-accelerated library

摘要: 最近,神经网络被广泛应用到许多领域。然而,随着神经网络模型越来越复杂,图形处理单元(graphics processing unit,GPU)被应用到深度学习中。GPU在加速矩阵计算方面展现出了卓越的性能,但是多样和复杂的神经网络模型导致网络训练阶段GPU的计算资源和显存并没有充分利用。对神经网络训练阶段进行细粒度的性能分析。首先从数据流的角度把训练过程分解为6个阶段,并测试每个阶段的延时;然后从GPU加速库、神经网络模型和批次三方面量化分析每一层的GPU计算效率和资源利用率;最后分析每层的参数和特征图的显存占用情况。实验发现:(1)cuDNN库卷积的计算效率是cuBLAS库的2倍。(2)卷积层的资源利用率比全连接层高50%。(3)不同层的显存利用率差异很大,整体利用率不高,最大不超过显存的20%。

关键词: 网络模型, 图形处理单元(GPU), 资源利用率, 计算效率, 数据流, GPU加速库