Parallelization and Optimization of Application for Phonon BTE

doi:10.3778/j.issn.1673-9418.1909072

Abstract

Abstract:

Heat conduction, as occurring at submicron scale can be predicted effectively using the Boltzmann transport equation (BTE) for phonons. Compared with the stochastic methods, the deterministic method represented by the finite volume method for the phonon BTE is considered to be more promising to solve engineering practical problems. However, the finite volume method has the problems of large number of iteration steps and long iteration time. To this end, the parallel acceleration scheme on GPU for the iterative solution part of phonon BTE is proposed. And the appropriate thread allocation method and data storage format are designed. This paper also applies the loop unrolling and merging kernel functions to optimize the iteration process. In addition, the multi-GPU version of phonon BTE is implemented by using the direction-based parallel strategy with the MPI+CUDA, CUDA-Aware MPI and NCCL (NVIDIA collective communications library). Experimental results show that the performance of the single GPU version on a V100 is up to 31.5X faster than the serial implementation of Intel Xeon Gold 6248. And the multi-GPU version with NCCL yields 83% parallel efficiency on 8 DGX-2 nodes with a total of 128 V100 GPUs, which is 57% higher than the parallel method using MPI+CUDA.

Key words: parallel acceleration, Boltzmann transport equation (BTE), DGX-2, compute unified device architecture (CUDA)

摘要：

声子玻尔兹曼输运方程（BTE）可以有效地模拟介观尺度下的导热问题，相比于随机性方法，以有限体积法为代表的确定性方法求解声子BTE方程被认为更有希望解决工程实际问题。但是有限体积法求解BTE具有迭代步数多，迭代时间长的问题。为此提出了声子BTE方程迭代求解部分在GPU上的并行加速方案，并设计适当的线程分配方式及数据存储格式，采用循环展开和内核融合等优化手段对迭代过程进行并行加速。此外，采用基于角方向的并行策略，使用MPI+CUDA、CUDA-Aware MPI和NCCL函数的方式实现了声子BTE求解多GPU并行版本。实验结果表明，相较于Intel Xeon Gold 6248上的串行版本，在单块V100 GPU上获得了最大31.5倍的加速。同时使用NCCL函数的GPU并行版本在8台DGX-2节点共计128块V100 GPU上最高达到了83%的并行效率，比MPI+CUDA版本提升57%。

关键词: 并行加速, 玻尔兹曼输运方程（BTE）, DGX-2, 统一计算设备架构（CUDA）

WEN Minhua, LIU Yongzhi, BAO Hua, HU Yue, SHEN Yongxing, WEI Jianwen, LIN Xinhua. Parallelization and Optimization of Application for Phonon BTE[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(8): 1288-1297.

文敏华，刘永志，鲍华，胡跃，沈泳星，韦建文，林新华. 声子BTE应用的并行和优化研究[J]. 计算机科学与探索, 2020, 14(8): 1288-1297.

References

[1] Tien C L. Microscale energy transfer[M]. Boca Raton: CRC Press, 1997.
[2] Ali S A, Kollu G, Mazumder S, et al. Large-scale parallel computation of the phonon Boltzmann transport equation[J]. International Journal of Thermal Sciences, 2014, 86: 341-351.
[3] Ni C, Murthy J. Parallel computation of the phonon Boltz-mann transport equation[J]. Numerical Heat Transfer Part B Fundamentals, 2009, 55(6): 435-456.
[4] Priimak D. Finite difference numerical method for the super-lattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations[J]. Journal of Com-putational Physics, 2014, 278: 182-192.
[5] Calore E, Gabbana A, Kraus J, et al. Massively parallel lattice-Boltzmann codes on large GPU clusters[J]. Parallel Computing, 2016, 58: 1-24.
[6] Bell N, Garland M. Efficient sparse matrix-vector multipli-cation on CUDA: NVR-2008-004[R]. Nvidia Corporation, 2008.
[7] Anzt H, Gates M, Dongarra J, et al. Preconditioned Krylov solvers on GPUs[J]. Parallel Computing, 2017, 68: 32-44.
[8] Péraud J P, Hadjiconstantinou N G. Efficient simulation of multidimensional phonon transport using energy-based variance-reduced Monte Carlo formulations[J]. Physical Review B, 2011, 84(20): 1555-1569.
[9] Allu P, Mazumder S. Hybrid ballistic-diffusive solution to the frequency-dependent phonon Boltzmann transport equation[J]. International Journal of Heat & Mass Transfer, 2016, 100: 165- 177.
[10] Escobar R A, Ghai S S, Jhon M S, et al. Multi-length and time scale thermal transport using the lattice Boltzmann method with application to electronics cooling[J]. International Journal of Heat & Mass Transfer, 2006, 49(1/2): 97-107.
[11] Nabovati A, Sellan D P, Amon C H. On the lattice Boltzmann method for phonon transport[J]. Journal of Computational Phy-sics, 2011, 230(15): 5864-5876.
[12] Ziman J M. Electrons and phonons[M]. Oxford: Clarendon Press, 1960.
[13] Sihn S, Roy A K. Nanoscale heat transfer using phonon Boltzmann transport equation[C]//Proceedings of the COMSOL Conference, 2009.
[14] Majumdar A. Microscale heat conduction in dielectric thin films[J]. Journal of Heat Transfer, 1993, 115(1): 7-16.
[15] Loy J M, Murthy J Y, Singh D. A fast hybrid Fourier-Boltzmann transport equation solver for nongray phonon transport[J]. Journal of Heat Transfer, 2013, 135(1): 011008.
[16] Hamian S, Yamada T, Faghri M, et al. Finite element analysis of transient ballistic-diffusive phonon heat transport in two-dimensional domains[J]. International Journal of Heat & Mass Transfer, 2015, 80: 781-788.
[17] NVIDIA. NVIDIA Tesla V100 GPU architecture[EB/OL]. [2019-07-03]. https://images.nvidia.com/content/volta-architecture/ pdf/volta-architecture-whitepaper.pdf.
[18] Luehr N. Fast multi-GPU collectives with NCCL[EB/OL]. [2019-07-03]. https://developer.nvidia.com/blog/fast-multi-gpu-collectives-nccl/.
[19] Balay S, Brown J, Buschelman K, et al. PETSc users manual [R/OL]. revision 3.13. Argonne: Argonne National Laboratory, 2019. https://www.mcs.anl.gov/petsc/petsc-current/docs/manual.pdf.