FPGA HPC using OpenCL: Case Study in 3D FFT
FPGAs have typically achieved high speedups for 3D Fast Fourier Transforms (FFTs) due to the presence of hard floating point units, low latency specialized pipelines, and support for complex connectivity among processing elements. Previous implementations have relied on FFT IP cores for performing the computation due to the complexity of manually developing and maintaining/upgrading efficient pipelines in HDL. These IP cores, however, are bulky and cannot be fully tuned for specific FFT sizes due to use of generic architectures. HLS tools, such as OpenCL, offer a more customizable alternative but have suffered from worse performance than HDL in previous work. In this paper we show that, using a set of code structure optimizations, OpenCL designs can be compiled to Radix-2 FFT pipelines which outperform IP core based designs for the same throughput. We further show that the HDL generated by the OpenCL compiler can be isolated and seamlessly integrated into existing 3D FFT shells to reduce implementation effort. Our single device design, tested on the Altera Arria10X115 FPGA, achieves an average speedup of 29x vs CPU-MKL, 4.1x vs GPU cuFFT and 1.1x vs IP Core FFT implementations for 163, 323 and 643 FFTs. Moreover, OpenCL generated compute pipelines for 83, 163, 323 and 643 FFTs use an average of 7.5x fewer ALMs and 1.6x fewer DSPs than corresponding IP core versions.