The last two decade has witnessed two opposing hardware trends where the DRAM capacity and the access bandwidth has rapidly increased by 128x and 20x respectively. In stark contrast with capacity and bandwidth, DRAM latency has almost remained constant, reducing by only 1.3x in the same time frame. Therefore, long memory latency continues to be a critical performance bottleneck in modern systems. Another emerging trend is the stagnating processor clock speeds due to the end of Dennard scaling. Parallel architectures, like CPUs and GPUs, resolved this problem by increasing parallelism, but developed architectures that rely extensively on data locality in the form of large cache hierarchies for multicores, and vectorized execution for SIMD-enabled CPUs and GPUs.
At the same time, many data-intensive applications are moving away from data locality towards irregular memory access behavior. This behavior is observed either because of dataflow (caused by indirection in data access) or control flow (caused by branch instructions) irregularity. Such applications are often memory bound and their performance is primarily determined by the memory latency (also known as the memory wall). An alternative approach is to mask long memory latencies by using multithreaded execution where a running thread relinquishes execution to a ready thread as soon as it performs a long-latency memory access. This dissertation explores how custom hardware accelerators using memory masking multithreaded techniques can be used to improve the performance of irregular applications. In hardware, multithreaded designs thread states are managed directly in hardware, and with enough application parallelism, they can fully mask memory latency without storing data in caches.
This thesis explores hardware multithreading on three different applications to achieve better performance on irregular applications. The first two applications, namely selection, and group-by aggregation are database relational operators. The selection operator examines different conditions on various row attributes. Depending on the evaluation, a row is either qualified or disqualified. This operator, therefore, exhibits a control flow irregularity. The proposed selection design shows an improvement of 1.8x over CPU while consuming 2.5x less power. Similarly, it is 3.2x more bandwidth efficient than GPUs. Overall, this dissertation provides the first direct comparison study of selection operator on all three architectures. The group-by aggregation, on the other hand, exhibits dataflow irregularity. Results show that the FPGA-accelerated approach significantly outperforms CPU-based implementations and yields speedup up to 10x. The third application considered for evaluation is a popular sparse linear algebra operation namely sparse matrix and vector multiplication.