TEB：GPU上矩阵分解重构的高效SpMV存储格式

doi:10.3778/j.issn.1673-9418.2304039

摘要/Abstract

摘要： 稀疏矩阵向量乘法（SpMV）是科学与工程领域中一个至关重要的计算过程，CSR（compressed sparse row）格式是最常用的稀疏矩阵存储格式之一，在图形处理器（GPU）平台上实现并行SpMV的过程中，其只存储稀疏矩阵的非零元，避免零元素填充所带来的计算冗余，节约存储空间，但存在着负载不均衡的问题，浪费了计算资源。针对上述问题，对近年来效果良好的存储格式进行了研究，提出了一种逐行分解重组存储格式——TEB（threshold-exchangeorder block）格式。该格式采用启发式阈值选择算法确定合适分割阈值，并结合基于重排序的行归并算法，对稀疏矩阵进行重构分解，使得块与块之间非零元个数尽可能得相近，其次结合CUDA（computer unified device architecture）线程技术，提出了基于TEB存储格式的子块间并行SpMV算法，能够合理分配计算资源，解决负载不均衡问题，从而提高SpMV并行计算效率。为了验证TEB存储格式的有效性，在NVIDIA Tesla V100平台上进行实验，结果表明TEB相较于PBC（partition-block-CSR）、AMF-CSR（adaptive multi-row folding of CSR）、CSR-Scalar（compressed sparse row-scalar）和CSR5（compressed sparse row 5）存储格式，在SpMV的时间性能方面平均可提升3.23、5.83、2.33和2.21倍；在浮点计算性能方面，平均可提高3.36、5.95、2.29和2.13倍。

关键词: 稀疏矩阵向量乘法（SpMV）, 重新排序, CSR格式, 负载均衡, 存储格式, 图形处理器（GPU）

Abstract: Sparse matrix-vector multiplication (SpMV) is a crucial computing process in the field of science and engineering. CSR (compressed sparse row) format is one of the most commonly used storage formats for sparse matrix. In the process of implementing parallel SpMV on the graphics processing unit (GPU), it only stores non-zero elements of sparse matrix, avoiding computational redundancy caused by zero element filling, and saving storage space. But there is a problem of load imbalance, which wastes computing resources. To address the aforementioned issues, storage formats with good performance in recent years have been studied, and a row by row decomposition and reorganization storage format—TEB (threshold-exchangeorder block) format has been proposed. The format first uses a heuristic threshold selection algorithm to determine the appropriate segmentation threshold, and combines the row merging algorithm based on reordering to reconstruct and decompose the sparse matrix, so that the number of non-zero elements between blocks is as close as possible. Furthermore, combined with CUDA (computer unified device architecture) thread technology, a parallel SpMV algorithm between sub blocks based on TEB storage format is proposed, which can reasonably allocate computing resources and solve the problem of load imbalance, thus improving the parallel computing efficiency of SpMV. In order to verify the effectiveness of the TEB storage format, experiments are conducted on the NVIDIA Tesla V100 platform. The results show that compared to PBC (partition-block-CSR), AMF-CSR (adaptive multi-row folding of CSR), CSR-Scalar (compressed sparse row-scalar), and CSR5 (compressed sparse row 5) storage formats, TEB can improve SpMV time performance by an average of 3.23×, 5.83×, 2.33×, and 2.21×. In terms of floating-point computing performance, the average improvement can be 3.36×, 5.95×, 2.29×, and 2.13×

Key words: sparse matrix-vector multiplication (SpMV), reorder, compressed sparse row (CSR) format, load balancing, storage format, graphics processing unit (GPU)

王宇华, 张宇琪, 何俊飞, 徐悦竹, 崔环宇. TEB：GPU上矩阵分解重构的高效SpMV存储格式[J]. 计算机科学与探索, 2024, 18(4): 1094-1108.

WANG Yuhua, ZHANG Yuqi, HE Junfei, XU Yuezhu, CUI Huanyu. TEB: Efficient SpMV Storage Format for Matrix Decomposition and Reconstruction on GPU[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(4): 1094-1108.

参考文献

[1] ATEZCAN E, TORUN T, KOSAR F, et al. Mixed and multi-precision SpMV for GPUs with row-wise precision selection[C]//Proceedings of the 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing, Bordeaux, Nov 2-5, 2022. Piscataway: IEEE, 2022: 31-40.
[2] SUN H Y, GAINARU A, SHANTHARAM M, et al. Selective protection for sparse iterative solvers to reduce the resilience overhead[C]//Proceedings of the 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing, Porto, Sep 9-11, 2020. Piscataway: IEEE, 2020: 141-148.
[3] 李秉政，黄高阳，许瑾晨. 面向申威众核处理器的LZMA并行算法设计与优化[J]. 计算机科学与探索, 2020, 14(9): 1501-1509.
LI B Z, HUANG G Y, XU J C. Design and optimization of parallel LZMA for many-core sunway processor[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(9): 1501-1509.
[4] YANG M L, DU Y L, SHENG X Q. Solving electromagnetic scattering problems with over 10 billion unknowns with the parallel MLFMA[C]//Proceedings of the 2019 Photonics & Electromagnetics Research Symposium-Fall, Xiamen, Dec 17-20, 2019. Piscataway: IEEE, 2019: 355-360.
[5] LIU J. Accuracy controllable SpMV optimization on GPU[C]//Proceedings of the 2022 4th International Conference on Artificial Intelligence and Computer Science, Beijing, Jul 30-31,2022. Bristol: IOP Publishing, 2022.
[6] AHMED M, USMAN S, SHAH N A, et al. AAQAL: a machine learning-based tool for performance optimization of parallel SPMV computations using block CSR[J]. Applied Sciences, 2022, 12(14): 7073.
[7] ISOTTON G, FRIGO M, SPIEZIA N, et al. Chronos: a general purpose classical AMG solver for high performance computing[J]. SIAM Journal on Scientific Computing, 2021, 43(5): 335-357.
[8] 肖汉，孙陆鹏，李彩林，等. 面向GPU的直方图统计图像增强并行算法[J]. 计算机科学与探索，2022, 16(10): 2273-2285.
XIAO H, SUN L P, LI C L, et al. GPU-oriented parallel algorithm for histogram statistical image enhancement[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(10): 2273-2285.
[9] CHEN Y D, XIAO G Q, WU F, et al. tpSpMV: a two-phase large-scale sparse matrix-vector multiplication kernel for many-core architectures[J]. Information Sciences, 2020, 523: 279-295.
[10] NAMASHIVAVAM N, MEHTA S, YEW P C. Variable-sized blocks for locality-aware SpMV[C]//Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, Seoul, Feb 27-Mar 3, 2021. Piscataway: IEEE, 2021: 211-221.
[11] BIAN H D, HUANG J Q, LIU L B, et al. ALBUS: a method for efficiently processing SpMV using SIMD and load balancing[J]. Future Generation Computer Systems, 2021, 116: 371-392.
[12] LIU W F, VINTER B. CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication[C]//Proceedings of the 29th ACM on International Conference on Supercomputing, California, Jun 30-31, 2015. New York: ACM, 2015: 339-350.
[13] ZHANG Y F, YANG W D, LI K L, et al. Performance analysis and optimization for SpMV based on aligned storage formats on an ARM processor[J]. Journal of Parallel and Distributed Computing, 2021, 158: 126-137.
[14] YESIL S, HEIDARSHENS A, MORRISON A, et al. Speeding up SpMV for power-law graph analytics by enhancing locality & vectorization[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Georgia, Nov 9-19, 2020. Piscataway: IEEE, 2020: 1-15.
[15] CUI H Y, WANG N B, WANG Y H, et al. An effective SPMV based on block strategy and hybrid compression on GPU[J]. The Journal of Supercomputing, 2022, 78(5): 6318-6339.
[16] LI Y S, XIE P Z, CHEN X H, et al. VBSF: a new storage format for SIMD sparse matrix-vector multiplication on modern processors[J]. The Journal of Supercomputing, 2020, 76(3): 2063-2081.
[17] BIAN H D, HUANG J Q, DONG R T, et al. A simple and efficient storage format for SIMD-accelerated SpMV[J]. Cluster Computing, 2021, 24(4): 3431-3448.
[18] GAO J H, JI W X, LIU J, et al. AMF-CSR: adaptive multi-row folding of CSR for SpMV on GPU[C]//Proceedings of the 2021 IEEE 27th International Conference on Parallel and Distributed Systems, Beijing, Dec 14-16, 2021. Piscataway: IEEE, 2021: 418-425.
[19] YANG W D, LI K L, LI K Q. A parallel computing method using blocked format with optimal partitioning for SpMV on GPU[J]. Journal of Computer and System Sciences, 2018, 92: 152-170.
[20] BARRIENTOS E C, INDALECIO G, LOUREIRO A G. Improving performance of iterative solvers with the AXC format using the Intel Xeon Phi[J]. The Journal of Supercomputing, 2018, 74(6): 2823-2840.
[21] BELL N, GARLAND M. Implementing sparse matrix-vector multiplication on throughput-oriented processors[C]//Proceedings of the International Conference for High Performance Computing, Networking, Portland, Nov 14-20, 2009. New York: ACM, 2009: 1-11.
[22] TALATI N, MAY K, BEHROOZI A, et al. Prodigy: improving the memory latency of data-indirect irregular workloads using hardware-software co-design[C]//Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture, Seoul, Feb 27-Mar 3, 2021. Piscataway: IEEE, 2021: 654-667.