Using Multithreaded Techniques to Mask Memory Latency on FPGA Accelerators
- Author(s): Halstead, Robert Joseph
- Advisor(s): Najjar, Walid A
- et al.
The performance gap between CPUs, and memory memory has diverged significantly since the 1980's making efficiency memory utilization a key concern for any application developer. Modern CPUs will process orders of magnitude more data than their memory architectures can sustain. Multiple levels of caches are used by the major CPU architects to cope with this issue. Frequently used data is stored as close as possible to the core, which allows it to be retrieved in a few cycles. Compared to the thousands of cycles it would take to be retrieved from main memory. However, data locality is important for caches to be effective, and as applications become more and more irregular the CPU's performance drops. This causes many important applications (e.g. sparse matrices, graphs, hash tables) to suffer from poor performance. This thesis explores how custom hardware accelerators using memory masking multithreaded techniques can be used to improve performance. In hardware multithreaded designs thread states are managed directly in hardware, and with enough application parallelism they can fully mask memory latency without storing data in caches. Designs scale well to match the memory architectures bandwidth. The emergence of heterogeneous FPGA platforms has made it easier to build and test the design on real world hardware.
This thesis starts with the issue of programmability. Hardware development is notoriously difficult and time consuming. The CHAT tool, a C-to-VHDL compiler, is presented that assists developers by generating custom multithreaded kernels from high level software descriptions. The thesis proceeds by using CHAT to generate a custom Sparse Matrix Vector (SpMV) kernel. Results show how multithreading can provide data independent performance. Compared to the software and GPU performance which drops significantly as the benchmark's irregularity increases. Cache miss rates increase on the CPU, and memory cannot be coalesced as efficiently on the GPUs. Finally we use multithreading to accelerate two common database operations: Hash join, and aggregation. They are the first in-memory implementations on FPGA hardware. The hash join design shows an improvement of 2x over the best multicore software designs available, and does so with 33% less memory bandwidth. Aggregation shows comparable performance when generating hash tables. However, it generates multiple tables that need to be merged, and this step reduces performance on high cardinality datasets.