CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. Such architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. However, this advantage is often overshadowed by two critical issues: 1) the poor programmability of FPGAs and 2) the severe overhead of CPU-FPGA integration. For one thing, the conventional RTL-based FPGA design practice significantly slows down the application development cycle. Although recent advances in high-level synthesis (HLS) have improved the FPGA programmability to some extent, it still leaves programmers facing the challenge of manually identifying the optimal design configuration in a tremendous design space. This challenge thus demands intimate knowledge of hardware intricacies to address and a great deal of effort even for hardware experts. For another, even with a high-quality FPGA accelerator that achieves an orders-of-magnitude performance/watt gain for a computation kernel, such an impressive gain can often be dramatically offset by the extra CPU-FPGA data communication overhead, resulting in a much reduced system-wide speedup, or even slowdown.
This thesis aims to address these two issues so as to facilitate the adoption of FPGAs in datacenters. To improve the FPGA programmability, we propose a methodology that automates the heavy code reconstruction from software programs towards behavioral descriptions of high-quality FPGA designs, through well-defined architecture templates. Specifically, we propose the composable, parallel and pipeline (CPP) microarchitecture as an accelerator design template. This well-defined architecture template derives high-quality accelerator designs for a broad class of computation kernels, and substantially reduce the overall design space. Also, it enables the introduction of the CPP analytical model that quantifies the performance-resource trade-offs among different configurations of the CPP template. This in turn leads to fast design space exploration to identify the optimal CPP configuration in a reasonable time. On top of the architecture template and its analytical model, we develop the AutoAccel framework to automatically transform an input computation kernel program into the optimal CPP-based design for it. For general application developers, AutoAccel supplies a nearly push-button experience to produce an FPGA accelerator with good performance; for FPGA design experts, it greatly reduces the effort of manual design space exploration and code reconstruction; it thus substantially improves the FPGA programmability in both cases.
To come up with an efficient CPU-FPGA integration methodology, we first conduct a quantitative analysis on the microarchitectures of state-of-the-art CPU-FPGA platforms, with a key focus on the effective latency and bandwidth of the CPU-FPGA data communication. The analysis results reveal three important factors that affect the efficiency of CPU-FPGA integration: 1) payload size of every data transfer, 2) the complicated, multi-stage CPU-FPGA data transfer routine, and 3) sharing of the FPGA resource among CPU threads. We then propose three techniques: batch processing, fully-pipelined data communication stack and FPGA-as-a-Service (FaaS) framework, for these three factors, respectively. Batch processing packs small inputs into a large payload; the fully-pipelined stack overlaps various data transfer stages and the compute stage; both improves the data processing throughput. The FaaS framework treats the CPU threads as clients, and the FPGA as the server, and shares the server among the clients via the canonical client-server paradigm. These three techniques form our proposed methodology for efficient CPU-FPGA integration, which is demonstrated through the JVM-FPGA integration process for the genome sequencing application.