Cheng, Shaoyi

Accelerator Synthesis and Integration for CPU+FPGA Systems

2016

Cheng, Shaoyi
Advisor(s): Wawrzynek, John

Abstract

As the scaling down of transistor size no longer provides boost to processor clock frequency, there has been a move towards parallel computers and more recently, heterogeneous computing platforms. To target the FPGA component in these systems, high-level synthesis (HLS) tools were developed to facilitate hardware generation from higher level algorithmic descriptions. Despite being an effective method for rapid hardware generation, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. Processor centric software implementations often have to be rewritten if good quality of results is desired.

In this work, we present a framework to refactor and restructure compute intensive software kernels, making them better suited for FPGA platforms. An algorithm was proposed to decouple memory operations and computation, generating accelerator pipelines composed of independent modules connected through FIFO channels. These decoupled computational pipelines have much better throughput due to their efficient use of the memory bandwidth and improved tolerance towards data access latency. Our methodology complements existing work in high-level synthesis and facilitates the creation of heterogeneous

systems with high performance accelerators and general purpose processors. With our approach, for a set of non-regular algorithm kernels written in C, a performance improvement of 3.3 to 9.1x is observed over direct C-to-Hardware mapping using a state-of-the-art HLS tool.

To ensure the absence of artificial deadlocks in the pipelines generated by our framework, we also formulated an analysis scheme examining various dependencies between operations distributed across different pipeline modules. The interactions between the modules' schedules, the capacity of the communication channels and the memory access mechanisms are all incorporated into our model, such that potential artificial deadlocks can be detected and resolved a priori. The applicability of our technique is not limited to the computational pipeline generated by our algorithm, but also other networks of communicating processes assuming their interaction with the channels follows a set of simple rules.

To push the limit in usability of FPGA platforms, we also explored the generation and integration of accelerators using only program binaries and execution profiles. Assuming no user input, the approach is only applied to more regular applications, where the memory access patterns are analyzable and coarse grained parallelism can be extracted. A run time mechanism is also devised to ensure the correctness of the parallelization performed during accelerator synthesis. With the help of binary instrumentation tools, it becomes possible to integrate the FPGA-accelerated parts into the original application in a user transparent way. Neither recompilation of the original program nor the source code is required. This approach is applied to a few benchmarks for which decoupled computational pipelines are synthesized. With memory level and coarse grained parallelization, significant improvement in performance (3.7 to 9x) over general purpose processor was observed, despite the FPGA running at a fraction of the CPU's clock frequency. The run time checking mechanism was also shown to only incur small overhead, especially for loop nests with large number of iterations.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Berkeley

Accelerator Synthesis and Integration for CPU+FPGA Systems