The massive parallelism provided by general-purpose GPUs (GPGPUs) possessing numerous compute threads in their streaming multiprocessors (SMs) and enormous memory bandwidths have made them the de-facto accelerator of choice in many scientific domains. To support the complex memory access patterns of applications, GPGPUs have a multi-level memory hierarchy consisting of a huge register file and an L1 data cache private to each SM, a banked shared L2 cache connected through an interconnection network across all SMs and high-bandwidth banked DRAM. With the amount of parallelism GPUs can provide, memory traffic becomes a major bottleneck, mostly due to the small amount of private cache that can be allocated for each thread, and the constant demand of data from the GPU’s many computation cores. This results in under-utilization of many SM components like register file, thereby incurring sizable overhead in the GPU power consumption due to wasted static energy of the registers. The aim of this dissertation is to develop techniques that can boost the performance in spite of small caches and improve power management techniques to boost energy saving.
In our first technique, we present \textit{PAVER}, a priority-aware vertex scheduler, which takes a graph-theoretic approach towards thread-block (TB) scheduling. We analyze the cache locality behavior among TBs and represent the problem using a graph representing the TBs and the locality among them. This graph will then be partitioned to TB groups that display maximum data sharing and assigned to the same SM by the locality-aware TB scheduler. This novel technique also reduces the leakage and dynamic access power of the L2 caches, while improving the overall performance of the GPU.
In our second study, \textit{Locality Guru}, we seek to employ the JIT analysis to find the data-locality between structures at various granularity such as threads, warps and TBs in a GPU Kernel using the load register's address tracing through a syntax tree. This information can help make smarter decisions for a locality aware data-partition and scheduling in single and multi-GPUs.
In the previous techniques, we gained performance benefit by exploiting the data-locality in the GPUs, which eventually translates to static energy saving in the whole GPU. Next, we analyze the static energy saving of the storage structures like L1 and L2 caches by directly applying power management techniques to save power during the time they are idle.
Finally, we develop, \textit{Slumber}, a realistic model for determining the wake-up time of registers from various under-volting and power gating modes. We propose a hybrid energy saving technique where a combination of power-gating and under-volting can be used to save optimum energy in the register file depending on the idle period of the registers with a negligible