Hardware accelerators such as GPUs and FPGAs can often provide enormous computing capabilities and power efficiency, as long as the working set fits in the on-board memory capacity of the accelerator. But if the working set does not fit, data must be streamed from the larger host memory or storage, causing performance to be limited by the slow communication bandwidth between the accelerator and the host. While compression is an effective method to reduce data storage and movement overhead, it has not been very useful in solving this issue due to efficiency and performance limitations. This is especially true for scientific computing accelerators with heavy floating-point arithmetic, because efficiently compressing floating-point numbers requires complex, floating-point specific algorithms.
This dissertation addresses the host-side bandwidth issue of accelerators, specifically FPGA accelerators, using a series of hardware-optimized compression algorithms. Since typical compression algorithms are not designed with efficient hardware implementation in mind, we explore and implement variants of existing algorithms for high performance and efficiency. We demonstrate the impact of our ideas using two classes of applications: Grid-based scientific computing, and high-dimensional nearest neighbor search. We have implemented a scientific computing accelerator platform (BurstZ+), which uses a class of novel error-controlled lossy floating-point compression algorithms (ZFP-V Series). We demonstrate that BurstZ+ can completely remove the host-accelerator communication bottleneck for accelerators. Evaluated against hand-optimized kernel accelerator implementations without compression, our single-pipeline BurstZ+ prototype outperforms an accelerator without compression by almost 4×, and even an accelerator with enough memory for the entire dataset by over 2×. We have also developed a near-storage high-dimensional nearest neighbor search accelerator (ZipNN) which uses a hardware-optimized group varint compression algorithm to remove the host-side communication bottleneck. Our ZipNN prototype outperforms an accelerator without compression by 6×, and even much costlier in-memory multithreaded software implementations by over 2×.