计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (10): 1654-1663.DOI: 10.3778/j.issn.1673-9418.1811030

• 高性能计算 • 上一篇    下一篇

申威众核处理器上的三对角并行求解器

刘侃,王欣亮,许平,薛巍   

  1. 1. 清华大学 计算机科学与技术系,北京 100086
    2. 国家超级计算无锡中心,江苏 无锡 214100
  • 出版日期:2019-10-01 发布日期:2019-10-15

Parallel Tridiagonal Solver on Sunway Many-Core Processors

LIU Kan, WANG Xinliang, XU Ping, XUE Wei   

  1. 1. Department of Computer Science and Technology, Tsinghua University, Beijing 100086, China
    2. National Supercomputing Center in Wuxi, Wuxi, Jiangsu 214100, China
  • Online:2019-10-01 Published:2019-10-15

摘要: 三对角方程求解器是一种在很多科学与工程领域广泛应用的数值计算核心。目前,CPU、GPU等主流硬件平台上都提出了高度优化的并行算法,但是对于中国自主研发的申威26010众核处理器,还没有一种算法能有效地利用其独特的硬件特性来达到最大化的性能。提出了一种分布式CR算法swDCR,来求解大量的、规模不大的三对角方程。该算法对每个三对角方程使用多个从核并行求解,通过联合多个从核的缓存使得运算过程中所有中间变量都能存储在缓存中,同时利用寄存器通信完成核间数据的高速传输。通过设计线程级数据划分机制,使得向量化的优化效果最大化。swDCR的吞吐率相比主核上的追赶法达到了单精度43.9倍和双精度36.7倍的加速,相比从核上的追赶法达到了单精度和双精度均2.07倍的加速。该算法在申威26010处理器单个核组上可以获得24 GB/s的有效带宽。

关键词: 三对角, 申威众核处理器, 循环消去(CR)算法

Abstract: Tridiagonal solver is an important numeric kernel that is widely used in scientific and engineering applications. Many highly optimized parallel algorithms on mainstream hardware platforms, such as CPU and GPU, have been proposed. However, on the Chinese domestically-made Sunway 26010 many-core processor, there is no such an algorithm that utilizes its unique hardware characteristics to maximize the performance. A Sunway-oriented distributive cyclic reduction algorithm (swDCR) is proposed in this paper, to solve a large number of small tridia-gonal equations. swDCR uses multiple CPEs (computation processing element) to solve each equation in parallel, combines the caches of multiple CPEs to store all the intermediate data in caches, and transmits data among CPEs using register communication. By well-designed thread-level data partition, the optimization effect of vectorization is maximized. swDCR outperforms MPE (management processing element) Thomas algorithm by 43.9 times in single precision and 36.7 times in double precision, and outperforms CPE Thomas algorithm by 2.07 times in both single and double precision. It achieves an effective bandwidth of 24 GB/s on one core group of Sunway 26010 processor.

Key words: tridiagonal, Sunway many-core processor, cyclic reduction (CR) algorithm