Cooley-Tukey FFT算法高性能实现与优化研究

doi:10.3778/j.issn.1673-9418.2011092

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (6): 1304-1315.DOI: 10.3778/j.issn.1673-9418.2011092

Cooley-Tukey FFT算法高性能实现与优化研究

郭金鑫¹^,², 张广婷²^,⁺(), 张云泉², 陈泽华¹, 贾海鹏²

1. 太原理工大学大数据学院,太原 0300242. 中国科学院计算技术研究所计算机体系结构国家重点实验室,北京 100190

收稿日期:2020-11-27 修回日期:2021-02-01 出版日期:2022-06-01 发布日期:2021-03-03
通讯作者: + E-mail: theking140@163.com
作者简介:郭金鑫（1994—）,男,山西阳泉人,硕士研究生,CCF学生会员,主要研究方向为高性能计算、并行编程等。
张广婷（1987—）,女,山东泰安人,硕士,工程师,CCF会员,主要研究方向为并行算法、大数据等。
张云泉（1973—）,男,山东聊城人,博士,研究员,博士生导师,CCF高级会员,主要研究方向为高性能并行计算,尤其是大规模并行计算及编程模型,高性能并行数值算法,并行程序的性能建模和评估等。
陈泽华（1974—）,女,山西神池人,教授,CCF高级会员,主要研究方向为粒计算与知识发现、工业大数据等。
贾海鹏（1983—）,男,山东潍坊人,博士,高级工程师,主要研究方向为异构计算、多核并行编程方法、多核上的计算机视觉算法等。
基金资助:
国家重点研发计划(2017YFB0202105);国家重点研发计划(2016YFB0200803);国家重点研发计划(2017YFB0202302);国家自然科学基金(61972376);北京市自然科学基金(L182053)

High-Performance Implementation and Optimization of Cooley-Tukey FFT Algorithm

GUO Jinxin¹^,², ZHANG Guangting²^,⁺(), ZHANG Yunquan², CHEN Zehua¹, JIA Haipeng²

1. College of Data Science, Taiyuan University of Technology, Taiyuan 030024, China
2. State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

Received:2020-11-27 Revised:2021-02-01 Online:2022-06-01 Published:2021-03-03
About author:GUO Jinxin, born in 1994, M.S. candidate, student member of CCF. His research interests include high-performance computing, parallel programming, etc.
ZHANG Guangting, born in 1987, M.S., engineer, member of CCF. Her research interests include parallel algorithms, big data, etc.
ZHANG Yunquan, born in 1973, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His research interests include high performance parallel computing, with particular emphasis on large scale parallel computation and programming models, high-performance parallel numerical algorithms, performance modeling and evaluation for parallel programs, etc.
CHEN Zehua, born in 1974, professor, senior member of CCF. Her research interests include granular computing and knowledge discovery, industrial big data, etc.
JIA Haipeng, born in 1983, Ph.D., senior engineer. His research interests include heterogeneous computing, many-core parallel programming method, computer vision algorithms on multi-/many-core processors, etc.
Supported by:
National Key Research and Development Program of China(2017YFB0202105);National Key Research and Development Program of China(2016YFB0200803);National Key Research and Development Program of China(2017YFB0202302);National Natural Science Foundation of China(61972376);Natural Science Foundation of Beijing(L182053)

摘要/Abstract

摘要：

快速傅里叶变换（FFT）算法是处理器基础软件生态的重要组成部分,在工程、科学、物理和数学等领域的应用十分广泛,且这些领域对FFT算法的性能也提出了越来越高的要求。研究FFT算法在ARMv8和X86-64上的高性能实现特别是大基高性能的实现,提高FFT算法的计算性能日益重要。针对ARMv8和X86-64计算平台的架构特征,研究FFT算法的高性能实现和优化方法。通过蝶形网络优化、大基网络级数降低、大基蝶形计算优化、SIMD汇编优化以及寄存器使用策略优化等方法的应用,有效提升了FFT算法的性能,特别是提升了FFT大基的计算性能,解决了寄存器不够用的性能瓶颈,并最终总结了一套Cooley-Tukey FFT算法的高性能实现策略和优化方案。实验结果表明,在ARM、X86-64处理器上,实现的FFT算法,较ARMPL、Intel MKL和FFTW性能有明显提升,较中小基性能也有明显提升。

关键词: 快速傅里叶变换（FFT）, ARMv8, X86-64, FFTW, SIMD优化

Abstract:

The fast Fourier transform (FFT) algorithm is considered as an important element of the processor’s basic software ecology, and it is widely applied in the field of engineering, science, physics and mathematics. Meanwhile, the requirements for the performance of FFT in these applications are also continuously rising. Therefore, it is of definite significance to study the high-performance implementation of FFT algorithm, especially the high-performance implementation of large radices of FFT in ARMv8 and X86-64, and to improve the calculation performance of FFT algorithm. In view of the architectural features of the ARMv8 and X86-64 computing platforms, this paper studies the high-performance implementation and optimization methods of the FFT algorithm. Through the application of butterfly network optimization, large radices network stages decrease, large radices butterfly computation optimization, SIMD (single instruction multiple data) assembly optimization, and register usage optimization methods, this paper effectively improves the performance of the FFT algorithm, considerably improves the calculation performance of the large radices of FFT, and solves the performance bottlenecks of insufficiency of register resources. Lastly, the summary of a set of Cooley-Tukey FFT algorithm high-performance implementation strategies and optimization solutions is made. The experimental results indicate that for the ARM and X86-64 processors, the FFT algorithm implemented can achieve a significant improvement in performance compared with ARMPL (ARM performance library), Intel MKL (math kernel library) and FFTW (fastest Fourier transform in the West) and can achieve a significant improvement in performance compared with small and medium radices.

Key words: fast Fourier transform (FFT), ARMv8, X86-64, FFTW, SIMD optimization

中图分类号:

TP311

郭金鑫, 张广婷, 张云泉, 陈泽华, 贾海鹏. Cooley-Tukey FFT算法高性能实现与优化研究[J]. 计算机科学与探索, 2022, 16(6): 1304-1315.

GUO Jinxin, ZHANG Guangting, ZHANG Yunquan, CHEN Zehua, JIA Haipeng. High-Performance Implementation and Optimization of Cooley-Tukey FFT Algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1304-1315.

图/表 13

图1 ARMv8架构浮点寄存器图

Fig.1 ARMv8 architecture float register

图2 时域抽取

Fig.2 DIT network

图3 Stockham蝶形网络

Fig.3 Stockham butterfly network

图4 Radix-14旋转因子复平面分布图

Fig.4 Radix-14 twiddles complex plane distribution

图5 SIMD优化

Fig.5 SIMD optimization

图6 计算指令对比

Fig.6 Instruction comparison

图7 顺序执行与指令重排对比

Fig.7 Sequential execution and instruction rearrangement

表1 实验环境

Table 1 Experimental environment

硬件	环境1	环境2
CPU	华为鲲鹏920	Xeon E5-2640 V4
Architecture	AArch64	X86-64
Frequency/GHz	2.1	2.4
SIMD/bit	128	256
L1 cache/KB	32	32
Compiler	9.2.0	5.4.0

图8 ARM 1D C2C FFT性能对比

Fig.8 Performance comparison of ARM 1D C2C FFT

图9 X86-64 1D C2C FFT性能对比

Fig.9 Performance comparison of X86-64 1D C2C FFT

表2 OpenFFT平均和最大加速

Table 2 Average and maximum speedups of OpenFFT %

Type	Speedup	ARMv8		X86-64
Type	Speedup	FFTW	ARMPL	FFTW	MKL
Float	Average	107.00	41.00	73.50	34.00
Float	Max	160.00	77.00	143.00	72.00
Double	Average	26.00	2.44	39.50	23.70
Double	Max	40.00	26.96	98.00	47.00

图10 ARMv8 1D C2C FFT性能对比

Fig.10 Performance comparison of ARMv8 1D C2C FFT

图11 X86-64 1D C2C FFT性能对比

Fig.11 Performance comparison of X86-64 1D C2C FFT

参考文献 15

[1]	COCHRAN W T, COOLEY J W, FAVIN D L, et al. What is the fast Fourier tranform?[J]. Proceedings of the IEEE, 1967, 55(10): 1664-1674. DOI URL
[2]	李焱, 张云泉. 异构平台上性能自适应FFT框架[J]. 计算机研究与发展, 2014, 51(3):637-649.
	LI Y, ZHANG Y Q. An automatic performance tuning framework for FFT on heterogenous platforms[J]. Journal of Computer Research and Development, 2014, 51(3): 637-649.
[3]	陈暾, 李志豪, 贾海鹏, 等. 基于ARMv8平台的多维FFT实现与优化研究[J]. 计算机学报, 2019, 42(11):2384-2402.
	CHEN T, LI Z H, JIA H P, et al. Multi-dimensional FFT implementation and optimization on ARMv8 platform[J]. Chinese Journal of Computers, 2019, 42(11): 2384-2402.
[4]	ARM. ARM performance libraries (ARMPL) 19.2.0[EB/OL]. [2020-09-10]. https://static.docs.arm.com/101004/1920/arm_ performance_libraries_reference_101004_1920_00_en.pdf
[5]	WANG E, ZHANG Q, SHEN B, et al. Intel math kernel library[M]// High-Performance Computing on the Intel^® Xeon Phi^TM. Berlin: Springer, 2014.
[6]	FRIGO M, JOHNSON S G. FFTW: an adaptive software architecture for the FFT[C]// Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, May 12-15, 1998. Piscataway: IEEE, 1998: 1381-1384.
[7]	DUHAMEL P, VETTERLI M. Fast Fourier transforms: a tutorial review and a state of the art[J]. Signal Processing, 1990, 19(4): 259-299. DOI URL
[8]	COOLEY J W, TUKEY J W. An algorithm for the machine calculation of complex Fourier series[J]. Mathematics of Computation, 1965, 19(90): 297-301. DOI URL
[9]	龚彤艳, 张广婷, 贾海鹏, 等. 一种偶数基Cooley-Tukey FFT高性能实现方法[J]. 计算机科学, 2020, 47(1):31-39.
	GONG T Y, ZHANG G T, JIA H P, et al.. High-performance implementation method for even basis of Cooley-Tukey FFT[J]. Computer Science, 2020, 47(1): 31-39.
[10]	WANG X, JIA H P, LI Z H, et al. Implementation and optimization of multi-dimensional real FFT on ARMv8 platform[C]// LNCS 11335: Proceedings of the 18th Interna-tional Conference on Algorithms and Architectures for Parallel Processing, Guangzhou, Nov 15-17, 2018. Cham: Springer, 2018: 338-353.
[11]	LI Z H, JIA H P, ZHANG Y Q, et al. Automatic generation of high-performance FFT kernels on Arm and x86 CPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(8): 1925-1941. DOI URL
[12]	AMD. AOCL: AMD optimizing CPU libraries[EB/OL]. [2020-09-12]. https://developer.amd.com/wp-content/resources/AMD-CPULibrariesUserGuide_1.0.pdf
[13]	NVIDIA. The NVIDIA CUDA fast Fourier transform library[EB/OL]. [2020-09-23]. https://developer.nvidia.com/cufft .
[14]	FRIGO M, JOHNSON S G. The design and implementation of FFTW3[J]. Proceedings of the IEEE, 2005, 93(2): 216-231. DOI URL
[15]	Intel. Intel math kernel library (Intel MKL) 2019 update4 [EB/OL]. [2020-09-20]. https://software.intel.com/en-us/mkl .

Cooley-Tukey FFT算法高性能实现与优化研究

High-Performance Implementation and Optimization of Cooley-Tukey FFT Algorithm

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 15

相关文章 2

编辑推荐

Metrics

[1]	陈德运，付立军，张学松，于梁，陈海龙，李骜. 多种表示的图像分类方法[J]. 计算机科学与探索, 2019, 13(12): 2138-2148.
[2]	李琨，贾海鹏，曹婷，张云泉. 大规模集群上多维FFT算法的实现与优化研究[J]. 计算机科学与探索, 2017, 11(6): 863-874.