An Empirically Guided Optimization Framework for FPGA OpenCL
FPGAs have been demonstrated to be capable of very high performance, especially power-performance, but generally at the cost of hand-tuned HDL code by FPGA experts. OpenCL is the leading industry effort in improving performance-programmability. But while it is recognized that optimizing OpenCL code using published best practices is critical to achieving good performance, even optimized code has so far rarely matched that of HDL code, or that available with competing technologies such as GPUs. In this paper we propose a series of systematic and empirically guided code optimizations that augment current best practices and substantially improve achieved performance. Our work characterizes and measures the impact of all of these optimizations. This enables programmers to not only follow a script when optimizing their own kernels, but also opens the way for the development of autotuners to perform optimizations automatically. We also demonstrate that, by applying these proposed code design practices to a number of parallel computing dwarfs, our optimized kernels outperform CPU and previous FPGA OpenCL implementations by 1.2× and 5× respectively. Moreover, our optimizations enable OpenCL FPGA codes to consistently achieve performance within striking distance of approximately 2× best current equivalent code for GPUs and HDL. To the best of our knowledge, this is at least 2× better than previous characterizations of OpenCL FPGA optimizations.