Survey of Linear Algebra Solvers for Exascale Computing

doi:10.3778/j.issn.1673-9418.2303076

Abstract

Abstract: The application of scientific engineering computing based on exascale computing not only offers oppor-tunities but also creates challenges for the development of numerical linear algebra algorithms. Firstly, the charac-teristics of exascale computing are analyzed, including: parallel programming for large-scale heterogeneous parallel architecture has become the mainstream approach; reducing the extremely high energy costs associated with running large-scale applications is a major concern; multi-precision heterogeneous computing hardware has triggered further research of mixed precision computing. Secondly, the optimization work of mainstream dense and sparse linear algebra solvers for high-performance computing architectures is reviewed, and the characteristics and advantages of each solver are compared. Then, the main technology progress of linear algebra solvers is summarized, mainly including: isolating heterogeneous computing modules and designing a new unified programming framework to achieve performance portability of software algorithms; improving the performance level of numerical computing and data storage using mixed precision methods while ensuring the overall requirements of scientific engineering computing applications; combined with hardware multi-level cache and network communication characteristics, advanced parallel computing algorithms are developed to avoid or reduce inefficient large-scale data communication. Finally, this paper provides an outlook on the future research trends in this direction.

Key words: high-performance computing, exascale computing, numerical linear algebra, mixed precision

摘要： 基于E级计算的科学工程计算应用给数值线性代数算法的发展，既提供了更多机遇，又带来了更大挑战。首先分析了E级计算的特点，包括：针对大规模异构并行体系结构的并行编程成为主流方式；降低运行大规模应用带来的极高能耗成本成为了主要考虑问题；多精度的异构计算硬件引发了混合精度算法进一步的研发。其次综述了主流稠密及稀疏线性代数解法器面向高性能计算体系架构进行的功能及性能方面的优化工作，对比分析了各解法器的特点及优势。随后总结分析了线性代数解法器核心技术进展，主要包括：隔离异构计算模块和设计新的统一编程框架，以实现软件算法的性能可移植；在保证科学工程计算应用的整体需求之下，利用混合精度方法提升数值计算和数据存储的性能水平；结合硬件多级cache和网络通讯特征发展先进并行计算算法，避免或减少效率低下的大规模数据通讯。最后对未来研究进行了展望。

关键词: 高性能计算, E级计算, 数值线性代数, 混合精度

HE Lianhua, XU Shun, JIN Zhong. Survey of Linear Algebra Solvers for Exascale Computing[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(10): 2265-2277.

何连花, 徐顺, 金钟. 面向E级计算的线性代数解法器研究综述[J]. 计算机科学与探索, 2023, 17(10): 2265-2277.

References

[1] BARRETT R, BERRY M, CHAN T, et al. Templates for the solution of linear systems: building block for iterative methods[M]. Philadelphia: SIAM, 1994.
[2] BJ?RCK ?. Numerical methods in matrix computations[M]. Cham: Springer, 2015.
[3] BAI Z J, DAMMEL J, DONGARRA J, et al. Templates for the solution of algebraic eigenvalue problems: a practical guide[M]. Philadelphia: SIAM, 2000.
[4] SAAD Y. Numerical methods for large eigenvalue problems: revised edition[M]. Philadelphia: SIAM, 2011.
[5] SAAD Y. Iterative methods for sparse linear systems[M]. Phi-ladelphia: SIAM, 2003.
[6] ANDERSON E, BAI Z, BISCHOF C, et al. LAPACK users’ guide[M]. Philadelphia: SIAM, 1992.
[7] BLACKFORD L S, CHOI J, CLEARY A, et al. ScaLAPACK users’ guide[M]. Philadelphia: SIAM, 1997.
[8] BUTTARI A, LANGOU J, KURZAK J, et al. A class of pa-rallel tiled linear algebra algorithms for multicore archi-tectures[J]. Parallel Computing, 2009, 35: 38-53.
[9] DONGARRA J, GATES M, HAIDAR A, et al. PLASMA: parallel linear algebra software for multicore using OpenMP[J]. ACM Transactions on Mathematical Software, 2019, 45(2): 1-35.
[10] BOSILCA G, BOUTEILLER A, DANALIS A, et al. Sca-lable dense linear algebra on heterogeneous hardware[J]. Advances in Parallel Computing, 2013, 28: 65-103.
[11] GATES M, KURZAK J, CHARARA A, et al. SLATE: design of a modern distributed and accelerated linear algebra library[C]//Proceedings of the 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Nov 17-19, 2019. New York: ACM, 2019: 1-18.
[12] LIU F F, MA W J, ZHAO Y W, et al. xMath2.0: a high-per-formance extended math library for SW26010-Pro many-core processor[J]. CCF Transactions on High Performance Computing, 2023, 5: 56-71.
[13] BALAY S, ABHYANKAR S, ADAMS M, et al. PETSc users manual(revision 3.15)[R]. 2021.
[14] HEROUX M A, BARTLETT R A, HOWLE V E, et al. An overview of the Trilinos project[J]. ACM Transactions on Mathematical Software, 2005, 31(3): 397-423.
[15] FALGOUT R D, JONES J E, YANG U M. The design and implementation of hypre, a library of parallel high per-formance preconditioners[M]//BRUASET A M, TVEITO A. Numerical Solution of Partial Differential Equations on Parallel Computers. Berlin, Heidelberg: Springer, 2006: 267-294.
[16] ANZT H, CHEN Y C, COJEAN T, et al. Towards con-tinuous benchmarking: an automated performance evalua-tion framework for high performance software[C]//Procee-dings of the Platform for Advanced Scientific Computing Conference, Zurich, Jun 12-14, 2019. New York: ACM, 2019: 1-11.
[17] LI X S. An overview of SuperLU: algorithms, implemen-tation, and user interface[J]. ACM Transactions on Mathe-matical Software, 2005, 31(3): 302-325.
[18] GHYSELS P, SYNK R. High performance sparse multifron-tal solvers on modern GPUs[J]. Parallel Computing, 2022, 110: 102897.
[19] YAMAMOTO Y. High-performance algorithms for nume-rical linear algebra[M]//GESHI M. The Art of High Perfor-mance Computing for Computational Science. Berlin, Hei-delberg: Springer, 2019: 113-136.
[20] BOSILCA G, BOUTEILLER A, DANALIS A, et al. DAGuE: a generic distributed DAG engine for high performance computing[J]. Parallel Computing, 2012, 38(1/2): 37-51.
[21] DEMMEL J, GRIGORI L, HOEMMEN M, et al. Commu-nication-optimal parallel and sequential QR and LU factori-zations[J]. SIAM Journal on Scientific Computing, 2012, 34: A206-A239.
[22] TAN L, KOTHAPALLI S, CHEN L, et al. A survey of power and energy efficient techniques for high performance nume-rical linear algebra operations[J]. Parallel Computing, 2014, 40(10): 559-573.
[23] ABDELFATTAH A, ANZT H, BOMAN E G, et al. A survey of numerical linear algebra methods utilizing mixed precision arithmetic[J]. International Journal of High Performance Computing Applications, 2021, 35(2): 109434202110033.
[24] HIGHAM N J, MARY T. Mixed precision algorithms in numerical linear algebra[J]. Acta Numerica, 2022, 31: 347-414.
[25] ELLIOTT J, HOEMMEN M, MUELLER F. Evaluating the impact of SDC on the GMRES iterative solver[C]//Pro-ceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, May 19-23, 2014. Washington: IEEE Computer Society, 2014: 1193-1202.
[26] YARKHAN A, KURZAK J, LUSZCZEK P, et al. Porting the PLASMA numerical library to the OpenMP standard[J]. International Journal of Parallel Programming,?2017, 45: 612-633.
[27] BOSILCA G, BOUTEILLER A, DANALIS A, et al. Flexible development of dense linear algebra algorithms on massi-vely parallel architectures with DPLASMA[C]//Procee-dings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, Anchorage, May 16-20, 2011. Piscataway: IEEE, 2011: 1432-1441.
[28] AGULLO E, AUMAGE O, FAVERGE M, et al. Achieving high performance on supercomputers with a sequential task-based programming model[J]. IEEE Transactions on Parallel and Distributed Systems, 2017. DOI: 10.1109/TPDS.2017.2766064.
[29] TOMOV S. MAGMA tutorial[R/OL]. (2020-02-03) [2023-03-21]. https://ecpannualmeeting.com/assets/overview/sessions/2020-magma-heffte-tutorial.pdf.
[30] GATES M, YARKHAN A, SUKKARI D, et al. Portable and efficient dense linear algebra in the beginning of the exas-cale era[C]//Proceedings of the 2022 IEEE/ACM Interna-tional Workshop on Performance, Portability and Produc-tivity in HPC, Dallas, Nov 13-18, 2022. Piscataway: IEEE, 2022: 36-46.
[31] ANZT H, BOMAN E, FALGOUT R, et al. Preparing sparse solvers for exascale computing[J]. Philosophical Transac-tions of the Royal Society of London Series A, 2020, 378: 20190053.
[32] BAVIER E, HOEMMEN M, RAJAMANICKAM S, et al. 2012 Amesos2 and Belos: direct and iterative solvers for large sparse linear systems[J]. Scientific Programming, 2012, 20: 241-255.
[33] EDWARDS H C, TROTT C R, SUNDERLAND D. Kokkos: enabling manycore performance portability through poly-morphic memory access patterns[J]. Journal of Parallel and Distributed Computing, 2014, 74(12): 3202-3216.
[34] BOOTH J D, ELLINGWOOD N D, THORNQUIST H K, et al. Basker: parallel sparse LU factorization utilizing hierar-chical parallelism and data layouts[J]. Parallel Computing,2017, 68: 17-31.
[35] KIM K, EDWARDS H C, RAJAMANICKAM S. Tacho: memory-scalable task parallel sparse Cholesky factorization[C]//Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, Van-couver, May 21-25, 2018. Washington: IEEE Computer So-ciety, 2018: 550-559.
[36] HEROUX M A, MCINNES L, LI S, et al.?ECP software technology capability assessment report: ORNL/TM-2022/2651[R]. Office of Science US Department of Energy. Office of Advanced Scientific Computing Research, 2022.
[37] STERCK H D, FALGOUT R D, NOLTING J W, et al. Dis-tance-two interpolation for parallel algebraic multigrid[J]. Numerical Linear Algebra with Applications, 2008, 15(2/3): 115-139.
[38] VASSILEVSKI P S, YANG U M. Reducing communication in algebraic multigrid using additive variants[J]. Numerical Linear Algebra with Applications, 2014, 21(2): 275-296.
[39] FALGOUT R D, SCHRODER J B. Non-Galerkin coarse grids for algebraic multigrid[J]. SIAM Journal on Scienti-fic Computing, 2014, 36(3): C309-C334.
[40] ALIAGA J I, ANZT H, GRüTZMACHER T, et al. Com-pressed basis GMRES on high-performance graphics proces-sing units[J]. The International Journal of High Performance Computing Applications, 2022: 1-18.
[41] FLEGAR G, ANZT H, COJEAN T, et al. Adaptive preci-sion Block-Jacobi for high performance preconditioning in the Ginkgo linear algebra software[J]. ACM Transactions on Mathematical Software, 2021, 47(2): 1-28.
[42] ANZT H, DONGARRA J, FLEGAR G, et al. Adaptive pre-cision in Block-Jacobi preconditioning for iterative sparse linear system solvers[J]. Concurrency and Computation: Practice and Experience, 2019, 31(6): e4460.
[43] ANZT H, RIBIZEL T, FLEGAR G, et al. ParILUT—a parallel threshold ILU for GPUs[C]//Proceedings of the 2019 IEEE International Parallel and Distributed Processing Sym-posium, Rio de Janeiro, May 20-24, 2019. Piscataway: IEEE, 2019: 231-241.
[44] DONGARRA J, GRIGORI L, HIGHAM N J. Numerical algorithms for high-performance computational science[J]. Philosophical Transactions of the Royal Society A, 2020, 378: 20190066.
[45] ABDELFATTAH A, ANZT H, AYALA A, et al. Advances in mixed precision algorithms: 2021 edition: SAND2021-10227R 698286[R]. 2021.
[46] CHARARA A, DONGARRA J, GATES M, et al. SLATE mixed precision performance report: ICL-UT-19-03[R]. University of Tennessee, 2019.
[47] BARRON D W, SWINNERTON-DYER H P F. Solution of simultaneous linear equations using a magnetic-tape store[J]. The Computer Journal, 1960, 3(1): 28-33.
[48] ALOMAIRY R, GATES M, CAYROLS S, et al. Commu-nication avoiding LU with tournament pivoting in SLATE:18, ICL-UT-22-01[R]. 2022.
[49] GRIGORI L, DEMMEL J W, XIANG H. Communication avoiding Gaussian elimination[C]//Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Austin, Nov 15-21, 2008. Piscataway: IEEE, 2008: 29.
[50] GRIGORI L, DEMMEL J W, XIANG H. CALU: a com-munication optimal LU factorization algorithm[J]. SIAM Journal on Matrix Analysis and Applications, 2011, 32(4):1317-1350.
[51] SAO P, VUDUC R, LI X. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems[J]. Journal of Parallel Distributed Computing, 2019, 131: 218-234.
[52] DING N, WILLIAMS S,?LIU Y, et al. Leveraging one-sided communication for sparse triangular solvers[C]//Pro-ceedings of the 2020 SIAM Conference on Parallel Pro-cessing for Scientific Computing, Seattle, Feb 12-15, 2020. Philadelphia: SIAM, 2020: 93-105.