计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (4): 550-558.DOI: 10.3778/j.issn.1673-9418.1611092

• 高性能计算 • 上一篇    下一篇

激光等离子体相互作用模拟的并行和加速研究

武海鹏1,文敏华1+,SEE Simon2,林新华1,3   

  1. 1. 上海交通大学 高性能计算中心,上海 200240
    2. NVIDIA Technology Center,新加坡
    3. 东京工业大学 学术国际情报中心,日本 东京
  • 出版日期:2018-04-01 发布日期:2018-04-04

Parallelization and Optimization of Laser-Plasma-Interaction Simulation

WU Haipeng1, WEN Minhua1+, SEE Simon2, LIN James1,3   

  1. 1. Center for High Performance Computing, Shanghai Jiao Tong University, Shanghai 200240, China
    2. NVIDIA Technology Center, Singapore
    3. Global Scientific Information and Computing Center, Tokyo Institute of Technology, Tokyo, Japan
  • Online:2018-04-01 Published:2018-04-04

摘要: 随着生成超短激光脉冲技术的不断发展,对这种激光脉冲和等离子体相互作用进行动力学描述也变得越来越重要。PIC(particle-in-cell)是一种在等离子体物理中,研究充能粒子在电磁场中运动轨迹的广泛采用的方法。尽管现在已经有一些在GPU上的PIC方法的实现,但是基于激光等离子体相互作用模拟的特点,仍然有很多重要问题可以尝试其他解决思路。提出了一种把初始的基于CPU的LPI模拟代码完整移植到GPU上的可行方法。提出了一系列加速初始的GPU版本的方法:动态冗余算法、混合精度算法、粒子排序算法。利用并且评估了GPUDirect RDMA(remote direct memory access)技术,其可以提高MPI的通信性能。实验结果证明,与初始的GPU版本相比,“Scatter”阶段加速比为6.1倍,当MPI传输数据大于3 KB时,通信过程提速了2.8倍。这些研究证明了针对模拟应用和GPU集群的特点进行特殊的优化能对性能带来显著的提升。

关键词: 激光等离子体相互作用, 粒子网格模拟, 统一计算设备架构(CUDA), CUDA优化, GPUDirect RDMA

Abstract: The progress in generating intense ultra-short laser pulse demands more and more for kinetic descriptions of the interaction of such laser pulse with plasmas. Particle-in-cell (PIC) algorithm is a widely-used method in plasma physics to study the trajectories of charged particles under electromagnetic fields. Though there have been some implementations of PIC algorithm on GPU, some important issues still need to be clarified in detail, based on the characteristic of the laser-plasma-interaction simulation. This paper introduces a way to change the original CPU laser-plasma-interaction code into a parameterized adaptive GPU implementation with the whole algorithm ported. Then, this paper develops a series of methods to speed up the particle scatter phase: dynamic duplication algorithm, mix-precision computing and a parameterized particle sorting algorithm. Furthermore, this paper utilizes the GPUDirect RDMA (remote direct memory access) technique in a Kepler cluster and evaluates how it can benefit the MPI communication performance. The results from the numerical experiment show that these optimizations produce a 6.1x speed-up compared with the initial GPU version using the same number of GPUs for the key “Scatter” phase. The speed-up for the MPI communication part is 2.8x when the message size is over 3 KB. All the findings demonstrate that particular optimizations based on the features of the simulation and modern GPU cluster are essential for achieving significantly improved performance.

Key words: laser-plasma-interaction simulation, particle-in-cell (PIC), compute unified device architecture (CUDA), CUDA optimization, GPUDirect RDMA