Long memory latencies are mitigated through the use of large cache hierarchies in multi-core architectures, SIMD execution in GPU architectures and streaming of data in FPGA-based accelerators. However, none of these approaches benefits irregular applications that exhibit no locality and rely on extensive pointer de-referencing for data accesses. By masking the memory latency, multi-threaded execution has been demonstrated to deal effectively with such applications.
In the MT-FPGA model a multi-threaded engine is implemented on the FPGA accelerator specifically for the masking on the memory latency in the execution of irregular applications: following a memory access, the execution is switched to a ready thread while the suspended threads wait for the return of the requested data value from memory. The multi-threaded engine is automatically generated, from C code, by the CHAT compilation tool and is customized to the specific application.
In this paper we use the Sparse Vector Matrix application to evaluate the performance of the MT-FPGA execution and compare it to the latest GPU architectures over a wide range of benchmarks.