计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (6): 1304-1315.DOI: 10.3778/j.issn.1673-9418.2011092

• 高性能计算 • 上一篇    下一篇

Cooley-Tukey FFT算法高性能实现与优化研究

郭金鑫1,2, 张广婷2,+(), 张云泉2, 陈泽华1, 贾海鹏2   

  1. 1. 太原理工大学 大数据学院,太原 0300242. 中国科学院 计算技术研究所 计算机体系结构国家重点实验室,北京 100190
  • 收稿日期:2020-11-27 修回日期:2021-02-01 出版日期:2022-06-01 发布日期:2021-03-03
  • 通讯作者: + E-mail: theking140@163.com
  • 作者简介:郭金鑫(1994—),男,山西阳泉人,硕士研究生,CCF学生会员,主要研究方向为高性能计算、并行编程等。
    张广婷(1987—),女,山东泰安人,硕士,工程师,CCF会员,主要研究方向为并行算法、大数据等。
    张云泉(1973—),男,山东聊城人,博士,研究员,博士生导师,CCF高级会员,主要研究方向为高性能并行计算,尤其是大规模并行计算及编程模型,高性能并行数值算法,并行程序的性能建模和评估等。
    陈泽华(1974—),女,山西神池人,教授,CCF高级会员,主要研究方向为粒计算与知识发现、工业大数据等。
    贾海鹏(1983—),男,山东潍坊人,博士,高级工程师,主要研究方向为异构计算、多核并行编程方法、多核上的计算机视觉算法等。
  • 基金资助:
    国家重点研发计划(2017YFB0202105);国家重点研发计划(2016YFB0200803);国家重点研发计划(2017YFB0202302);国家自然科学基金(61972376);北京市自然科学基金(L182053)

High-Performance Implementation and Optimization of Cooley-Tukey FFT Algorithm

GUO Jinxin1,2, ZHANG Guangting2,+(), ZHANG Yunquan2, CHEN Zehua1, JIA Haipeng2   

  1. 1. College of Data Science, Taiyuan University of Technology, Taiyuan 030024, China
    2. State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2020-11-27 Revised:2021-02-01 Online:2022-06-01 Published:2021-03-03
  • About author:GUO Jinxin, born in 1994, M.S. candidate, student member of CCF. His research interests include high-performance computing, parallel programming, etc.
    ZHANG Guangting, born in 1987, M.S., engineer, member of CCF. Her research interests include parallel algorithms, big data, etc.
    ZHANG Yunquan, born in 1973, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His research interests include high performance parallel computing, with particular emphasis on large scale parallel computation and programming models, high-performance parallel numerical algorithms, performance modeling and evaluation for parallel programs, etc.
    CHEN Zehua, born in 1974, professor, senior member of CCF. Her research interests include granular computing and knowledge discovery, industrial big data, etc.
    JIA Haipeng, born in 1983, Ph.D., senior engineer. His research interests include heterogeneous computing, many-core parallel programming method, computer vision algorithms on multi-/many-core processors, etc.
  • Supported by:
    National Key Research and Development Program of China(2017YFB0202105);National Key Research and Development Program of China(2016YFB0200803);National Key Research and Development Program of China(2017YFB0202302);National Natural Science Foundation of China(61972376);Natural Science Foundation of Beijing(L182053)

摘要:

快速傅里叶变换(FFT)算法是处理器基础软件生态的重要组成部分,在工程、科学、物理和数学等领域的应用十分广泛,且这些领域对FFT算法的性能也提出了越来越高的要求。研究FFT算法在ARMv8和X86-64上的高性能实现特别是大基高性能的实现,提高FFT算法的计算性能日益重要。针对ARMv8和X86-64计算平台的架构特征,研究FFT算法的高性能实现和优化方法。通过蝶形网络优化、大基网络级数降低、大基蝶形计算优化、SIMD汇编优化以及寄存器使用策略优化等方法的应用,有效提升了FFT算法的性能,特别是提升了FFT大基的计算性能,解决了寄存器不够用的性能瓶颈,并最终总结了一套Cooley-Tukey FFT算法的高性能实现策略和优化方案。实验结果表明,在ARM、X86-64处理器上,实现的FFT算法,较ARMPL、Intel MKL和FFTW性能有明显提升,较中小基性能也有明显提升。

关键词: 快速傅里叶变换(FFT), ARMv8, X86-64, FFTW, SIMD优化

Abstract:

The fast Fourier transform (FFT) algorithm is considered as an important element of the processor’s basic software ecology, and it is widely applied in the field of engineering, science, physics and mathematics. Meanwhile, the requirements for the performance of FFT in these applications are also continuously rising. Therefore, it is of definite significance to study the high-performance implementation of FFT algorithm, especially the high-performance implementation of large radices of FFT in ARMv8 and X86-64, and to improve the calculation performance of FFT algorithm. In view of the architectural features of the ARMv8 and X86-64 computing platforms, this paper studies the high-performance implementation and optimization methods of the FFT algorithm. Through the application of butterfly network optimization, large radices network stages decrease, large radices butterfly computation optimization, SIMD (single instruction multiple data) assembly optimization, and register usage optimization methods, this paper effectively improves the performance of the FFT algorithm, considerably improves the calculation performance of the large radices of FFT, and solves the performance bottlenecks of insufficiency of register resources. Lastly, the summary of a set of Cooley-Tukey FFT algorithm high-performance implementation strategies and optimization solutions is made. The experimental results indicate that for the ARM and X86-64 processors, the FFT algorithm implemented can achieve a significant improvement in performance compared with ARMPL (ARM performance library), Intel MKL (math kernel library) and FFTW (fastest Fourier transform in the West) and can achieve a significant improvement in performance compared with small and medium radices.

Key words: fast Fourier transform (FFT), ARMv8, X86-64, FFTW, SIMD optimization

中图分类号: