High-level synthesis (HLS) tools simplify the FPGA design processes by allowing users to express their designs in high-level languages such as C/C++ or OpenCL. In this way, users could focus on algorithmic optimization with less concern for the cycle-by-cycle details at the register-transfer level (RTL). However, FPGA development flows still have two major limitations that hinder the adoption of FPGAs:
- Limited achievable frequency. There still exists a considerable gap between the quality-of-result (QoR) of an HLS-generated design and what is achievable by an RTL expert, especially the maximum operating frequency of the design. With the designs being scaled up in size, the final achievable frequency will be even lower. Unfortunately, a frequency degradation will directly lead to a proportional performance drop.
- Prolonged compilation time. In the current FPGA CAD flow, the RTL generated by the HLS compiler will be passed to the traditional synthesis and implementation tool. Although the C-to-RTL compilation is relatively quick, the RTL-to-bitstream implementation process will take much longer. With the designs becoming increasingly complex and the FPGA devices larger, the compile time surges from hours to days. Such an overlong process will seriously limit the working efficiency of engineers, especially when compared to software compilation that only takes seconds or minutes.
We observe that the existing FPGA CAD flows have not taken full advantage of HLS for further timing optimization and compilation reduction. Currently, the synthesis, placement, and routing tools are implemented and optimized to handle arbitrary RTL inputs. Those tools will adhere to the cycle-accurate behavior of the input design to ensure the correctness of the output. However, HLS-generated RTL is highly flexible and may tolerate additional pipeline registers without causing functional errors. Such latency-insensitive properties could significantly help the downstream compilation with timing closure. However, in the current toolchains, the HLS compilation is a standalone step, and the HLS-generated RTL will be treated in the same way as manually-written RTL by the logic synthesis tool. As a result, the information on pipeline flexibility in HLS designs will be lost, and the downstream physical implementation process cannot insert pipeline registers for timing closure.
Based on this observation, we propose methods to co-optimize the HLS compilation and the physical design process, which will enable frequency improvement and speed up the hardware accelerator development process simultaneously. Different from the conventional compilation stacks that separate the HLS compilation from the downstream physical implementation process, we propose to bridge the gap between HLS and physical design organically. By facilitating placement and routing with the latency-insensitive information of HLS, and in turn by guiding the HLS compilation with the physical layout information, we could achieve significant improvement in QoR and reduction in compile time.
Centered around this core idea, my thesis consists of three major parts. First, we explore how to improve the inherent timing quality of the RTL generated by HLS. Next, we couple HLS scheduling with coarse-grained floorplanning to improve the achievable frequency. Finally, we take one step further by partitioning the design for parallel placement and routing, then efficiently stitch them together without losing timing quality.
First, the thesis addresses the timing-closure challenge by improving the inherent timing quality of the machine-generated RTL. This chapter studies the timing issues in a diverse set of realistic and complex FPGA HLS designs, including two of my previously-published accelerator designs for genome sequencing. We observe that in almost all cases, the frequency degradation is caused by the broadcast structures generated by the HLS compiler. We classify three major types of broadcasts and propose a set of effective yet easy-to-implement approaches. Our experimental results show that our methods can improve the maximum frequency of a set of nine representative HLS benchmarks by 53\% on average.
In addition to optimizing the QoR of HLS by itself, the thesis further pushes up the final frequency by coupling HLS compilation with floorplanning. We propose AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation. Since pipelining may introduce additional latency, we further present analysis and algorithms to ensure the added latency will not compromise the overall throughput. In our experiments with a total of 43 design configurations, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments, we make the originally unroutable designs achieve 274 MHz on average. AutoBridge was recognized with the Best Paper Award in FPGA 2021.
Finally, we take one step further to enable parallel physical implementation on top of our HLS-floorplan co-design methodology. We propose a split compilation approach based on the pipelining flexibility at the HLS level. The pipeline flexibility allows us to partition designs for parallel placement and routing without timing degradation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. When tested on the AMD/Xilinx U280 FPGA, we observed a 5-7X compile time reduction and a 1.3X frequency increase. RapidStream was recognized with the Best Paper Award in FPGA 2022.
In conclusion, my thesis targets two of the most challenging problems for modern EDA tools: timing closure and agile compilation. We first study the fanout optimization at the HLS level. Next, we explore the co-optimization of HLS and floorplanning, which has been used by at least eight other accelerator design projects. Finally, we enable the split compilation of HLS designs to reduce the compile time significantly. At the end of the thesis, we discuss future directions, including extending the methodology to support the compilation of RTL designs, multi-FPGA designs, and ASIC designs.