Hashing, Caching, and Synchronization: Memory Techniques for Latency Masking Multithreaded Applications
- Author(s): Windh, Skyler Arron
- Advisor(s): Najjar, Walid A
- et al.
The increase in size and decrease in cost of DRAMs has led to a rapid growth of in-memory solutions to data analytics. In this area, performance is often limited by the latency and bandwidth of the memory system. Furthermore, the move to multicore execution has put added pressure on the memory bandwidth and often results in additional latency.
Irregular applications, by their very nature, suffer from poor data locality. This often results in high miss rates for caches and many long waits to off-chip memory. Historically, long latencies have been dealt with in two ways: (1) latency mitigation using large cache hierarchies, or (2) latency masking where threads relinquish their control after issuing a memory request. Multithreaded CPUs are designed for a fixed maximum number of threads tailored for an average application. FPGAs, however, can be customized to specific applications. Their massive parallelism is well known, and ideally suited to dynamically manage hundreds, or thousands, of threads. Multithreading, in essence, trades memory bandwidth for latency.
This thesis describes the use of CAMs (Content Addressable Memories) as synchronizing caches for hardware multithreading.
We demonstrate and evaluate this mechanism by implementing multithreaded datapaths for Breadth First Search, Hash-Join, and Group-By Aggregation. Synchronization between concurrent threads is typically implemented using expensive in-memory locks that are accessed via atomic operations. CAMs allow us to move the lock on chip, increase the multithreading, and achieve better performance.