计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (9): 1501-1509.DOI: 10.3778/j.issn.1673-9418.1909070

• 学术研究 • 上一篇    下一篇

面向申威众核处理器的LZMA并行算法设计与优化

李秉政,黄高阳,许瑾晨   

  1. 1. 数学工程与先进计算国家重点实验室,郑州 450000
    2. 江南计算技术研究所,江苏 无锡 214125
  • 出版日期:2020-09-01 发布日期:2020-09-07

Design and Optimization of Parallel LZMA for Many-Core Sunway Processor

LI Bingzheng, HUANG Gaoyang, XU Jinchen   

  1. 1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450000, China
    2. Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214125, China
  • Online:2020-09-01 Published:2020-09-07

摘要:

随着高性能计算和科学计算应用的发展,高性能计算集群系统传输、存储和处理的数据规模呈现爆炸式增长。对大规模数据进行高效的压缩,减少数据存储所需空间和传输所需的通信带宽,是提升高性能计算集群系统性能的关键之一。无损压缩算法中,LZMA算法具有较高的压缩率,但串行版本的LZMA算法压缩速率很慢。采用多核架构的处理器对无损压缩算法进行并行化,是提升压缩速率的一个研究方向。设计并实现了面向申威26010异构众核处理器并行化LZMA算法。结合申威异构众核处理器的特点,对LZMA算法存储空间需求、访存特性、热点函数等进行分析,基于Athread接口实现LZMA算法从核多线程并行,并对LDM地址空间进行细粒度的布局与优化以获得更好的缓存性能,实现DMA双缓冲的循环滑动窗口算法。测试结果表明,相较主核串行版本算法,并行LZMA算法在Silesia语料库基准测试集和大规模数据集中分别获得了4.1倍和5.3倍的最大加速比,获得了较好的加速效果。

关键词: 并行计算, 异构众核处理器, LZMA, 压缩算法

Abstract:

In recent years, the development of high-performance computing and scientific computing applications results in a huge explosion of data transmitted, stored, and processed by high-performance computing cluster systems. Under this circumstance, efficient compression of large-scale data is needed to improve the performance of high-performance computing cluster systems, which will reduce not only the space required for data storage, but also the communication bandwidth required for transmission. In lossless compression algorithms, LZMA (Lempel Ziv-Markov chain algorithm) has the high compression ratio, but the compression rate of LZMA algorithm in serial version is very slow. Lots of studies use parallel computing to promote the performance of lossless compression algorithms, taking advantage of multi-core architectures. This paper proposes a parallel design and optimization of LZMA based on the Sunway 26010 heterogeneous many-core processor. Combining with Sunway heterogeneous many-core processor’s features, several key factors affecting the performance of LZMA are analyzed, such as space requirements, memory access features, hotspot functions, etc. Based on the Athread interface, the sliding window algorithm of LZMA is reconstructed for the multi-thread parallel. LDM address space is fine-grained and optimized to achieve a better cache performance. Cyclic sliding window algorithm is also achieved using DMA double buffer. The test results show that using the Silesia Corpus benchmark, the final optimized LZMA algorithm achieves a maximum speedup of 4.1 times over the serial baseline implementation of the controller core, while on the big data benchmark speedup is 5.3 times.

Key words: parallel computing, heterogeneous many-core processors, Lempel Ziv-Markov chain algorithm (LZMA), compression algorithm