计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (7): 769-777.DOI: 10.3778/j.issn.1673-9418.1312032

• 高性能计算 • 上一篇    下一篇

Ultra-Mat:基于平面波的第一原理异构计算软件

贾伟乐1,2,3,曹宗雁1+,王  龙1,迟学斌1,高卫国4,汪林望5   

  1. 1. 中国科学院 计算机网络信息中心,北京 100190
    2. 中国科学院大学,北京 100190
    3. 北京北龙超级云计算有限责任公司,北京 100190
    4. 复旦大学 数学系,上海 200433
    5. 劳伦斯伯克利国家实验室,美国
  • 出版日期:2014-07-01 发布日期:2014-07-02

Ultra-Mat: A Heterogeneous First Principle Calculation Software Based on Plane Wave

JIA Weile1,2,3, CAO Zongyan1+, WANG Long1, CHI Xuebin1, GAO Weiguo4, WANG Linwang5   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100190, China
    3. Beijing Beilong Chao Yuan Company, Beijing 100190, China
    4. Department of Mathematics, Fudan University, Shanghai 200433, China
    5. Lawrence Berkeley National Laboratory, USA
  • Online:2014-07-01 Published:2014-07-02

摘要: 基于平面波的第一原理计算方法是目前材料科学中最常用的方法,但传统的CPU并行计算遇到可扩展性瓶颈,无法改善其求解的绝对速度。系统地介绍了利用图形处理器(graphic processing unit,GPU)加速技术开发的大规模第一原理材料计算软件:Ultra-Mat。该软件对第一原理平面波算法进行了系统的算法设计和软件实现:(1)通过采用并行方案,实现了快速傅里叶变换(fast Fourier transform,FFT)的GPU局部操作;(2)设计了基于数据压缩的混合精度算法,显著减少了电子结构计算部分的MPI(message passing interface)通信;(3)完成了逾90%代码的GPU实现,目的是最大限度地减少中间流程,以避免CPU-GPU切换引发的数据传输,这是GPU应用中公认的性能瓶颈。测试结果显示Ultra-Mat具有很好的计算性能,对于512原子的GaAs系统,在电子结构计算部分,使用256 GPU卡相比4096 CPU核心有18倍的加速。

关键词: 图形处理器(GPU), 第一性原理, 平面波贋势密度泛函

Abstract: First principle calculation based on plane wave is the most popular method in material science simulation. However, traditional CPU parallelization has encountered the scalability bottleneck. Thus the absolute computing time cannot be reduced by using more CPU cores. This paper presents a first principle calculation software on large scale GPU (graphic processing unit) cluster: Ultra-Mat. It also redesigns and implements the algorithm: (1) Utilize a hybrid parallelization scheme to do FFT (fast Fourier transform) in single GPU card. (2) Design and implement a mix precision algorithm to avoid CPU-GPU memory copy and MPI (message passing interface) communication. (3) Implement more than 90% of the codes using CUDA. This step reduces the CPU-GPU memory copy operation, which is an accepted bottleneck in the heterogonous supercomputer. For a 512 atom GaAs system, the testing results show that, the method of using 256 GPU cards has 18 times speedup in the electronic structure calculation compared with 4096 CPU cores.

Key words: graphic processing unit (GPU), first principle calculation, plane wave pseudopotential density functional theory