计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (10): 1664-1676.DOI: 10.3778/j.issn.1673-9418.1811029

• 高性能计算 • 上一篇    下一篇

LQCD Dslash在神威·太湖之光上的研究分析与MPI实现

张淼,周宇,陈建海,何钦铭,徐顺,宫明   

  1. 1. 浙江大学 计算机科学与技术学院,杭州 310012
    2. 中国科学院 计算机网络信息中心,北京 100190
    3. 中国科学院 高能物理研究所,北京 100049
  • 出版日期:2019-10-01 发布日期:2019-10-15

Analysis and MPI Implementation of LQCD Dslash on Sunway TaihuLight

ZHANG Miao, ZHOU Yu, CHEN Jianhai, HE Qinming, XU Shun, GONG Ming   

  1. 1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310012, China
    2. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    3. Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China
  • Online:2019-10-01 Published:2019-10-15

摘要: “神威·太湖之光”是我国全自主研发的千万核超级计算机,目前已有很多大型应用程序在此先进架构上进行了移植优化。然而,高能物理领域的格点量子色动力学(LQCD)数值模拟软件在神威平台上尚未进行过移植优化,这引起了科学工作者们的关注。针对LQCD在神威平台上的移植优化问题展开研究。首先,论述了国内外对LQCD在不同硬件架构上进行并行优化的发展历程。其次,通过对其热点模块Dslash的重构,实现了在神威平台上的成功移植。再次,针对申威26010芯片异构众核的架构和并行模式,实现了从核阵列异构并行、从核本地设备存储器(LDM)与主存之间的直接存储访问(DMA)通讯、主核之间的消息传递接口(MPI)通讯及全局归约等操作。最后,经过实验测试,单核组优化程序与16核组优化程序相比单主核程序分别获得了165倍和25倍的加速比,并发现了一些重要的性能瓶颈问题,为进一步优化提升整体效率奠定重要基础。同时,对国产超算平台的推广使用具有积极意义。

关键词: 格点量子色动力学(LQCD), Dslash, 消息传递接口(MPI), 神威·, 太湖之光, 众核芯片

Abstract: Sunway TaihuLight is the supercomputer whose cores are more than ten million developed by China in its own independent way. Many large scale applications have been transplanted and optimized on it. However, the lattice quantum chromodynamics (LQCD) application of high energy physics has not been ported and optimized on the Sunway platform, which has attracted the attention of researchers. In this paper, the transplantation and optimization of LQCD on Sunway platform is studied. Firstly, the development at home and abroad of parallel optimization of LQCD in different hardware architectures is discussed. Secondly, through the reconstruction of its hot module—Dslash, it realizes the successful transplantation on Sunway platform. Thirdly, according to the architecture and parallel mode of the heterogeneous many-core SW26010 processor, the heterogeneous parallelism of the computing processing element (CPE) cluster, the direct memory access (DMA) communication between the CPE local device memory (LDM) and the main memory, the message passing interface (MPI) communication between the management processing elements (MPE), and the global reduction are realized. Finally, through the experiment, the optimized program of single core group (CG) version and the optimized program of 16 CGs version achieve 165 and 25 times speedups accordingly compared with single MPE version, and some important performance bottlenecks are found, which lays an important foundation for further optimization to improve the overall performance. At the same time, the work of this paper has positive significance for the popularization of the domestic supercom-puting platform.

Key words: lattice quantum chromodynamics (LQCD), Dslash, message passing interface (MPI), Sunway TaihuLight, many-core processor