计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (8): 1288-1297.DOI: 10.3778/j.issn.1673-9418.1909072

• 高性能计算 • 上一篇    下一篇

声子BTE应用的并行和优化研究

文敏华,刘永志,鲍华,胡跃,沈泳星,韦建文,林新华   

  1. 1. 上海交通大学 高性能计算中心,上海 200240
    2. 上海交通大学 密西根学院,上海 200240
  • 出版日期:2020-08-01 发布日期:2020-08-07

Parallelization and Optimization of Application for Phonon BTE

WEN Minhua, LIU Yongzhi, BAO Hua, HU Yue, SHEN Yongxing, WEI Jianwen, LIN Xinhua   

  1. 1. Center for High Performance Computing, Shanghai Jiao Tong University, Shanghai 200240, China
    2. University of Michigan-Shanghai Jiao Tong University Joint Institute, Shanghai Jiao Tong University, Shanghai 200240, China
  • Online:2020-08-01 Published:2020-08-07

摘要:

声子玻尔兹曼输运方程(BTE)可以有效地模拟介观尺度下的导热问题,相比于随机性方法,以有限体积法为代表的确定性方法求解声子BTE方程被认为更有希望解决工程实际问题。但是有限体积法求解BTE具有迭代步数多,迭代时间长的问题。为此提出了声子BTE方程迭代求解部分在GPU上的并行加速方案,并设计适当的线程分配方式及数据存储格式,采用循环展开和内核融合等优化手段对迭代过程进行并行加速。此外,采用基于角方向的并行策略,使用MPI+CUDA、CUDA-Aware MPI和NCCL函数的方式实现了声子BTE求解多GPU并行版本。实验结果表明,相较于Intel Xeon Gold 6248上的串行版本,在单块V100 GPU上获得了最大31.5倍的加速。同时使用NCCL函数的GPU并行版本在8台DGX-2节点共计128块V100 GPU上最高达到了83%的并行效率,比MPI+CUDA版本提升57%。

关键词: 并行加速, 玻尔兹曼输运方程(BTE), DGX-2, 统一计算设备架构(CUDA)

Abstract:

Heat conduction, as occurring at submicron scale can be predicted effectively using the Boltzmann transport equation (BTE) for phonons. Compared with the stochastic methods, the deterministic method represented by the finite volume method for the phonon BTE is considered to be more promising to solve engineering practical problems. However, the finite volume method has the problems of large number of iteration steps and long iteration time. To this end, the parallel acceleration scheme on GPU for the iterative solution part of phonon BTE is proposed. And the appropriate thread allocation method and data storage format are designed. This paper also applies the loop unrolling and merging kernel functions to optimize the iteration process. In addition, the multi-GPU version of phonon BTE is implemented by using the direction-based parallel strategy with the MPI+CUDA, CUDA-Aware MPI and NCCL (NVIDIA collective communications library). Experimental results show that the performance of the single GPU version on a V100 is up to 31.5X faster than the serial implementation of Intel Xeon Gold 6248. And the multi-GPU version with NCCL yields 83% parallel efficiency on 8 DGX-2 nodes with a total of 128 V100 GPUs, which is 57% higher than the parallel method using MPI+CUDA.

Key words: parallel acceleration, Boltzmann transport equation (BTE), DGX-2, compute unified device architecture (CUDA)