Enhanced Register Data-Flow Techniques for High-Performance, Energy-Efficient GPUs
- Author(s): Asghari Esfeden, Hodjat
- Advisor(s): Abu-Ghazaleh, Nael
- et al.
To avoid immoderate power consumption, the chip industry has shifted away from highperformance single threaded designs to high throughput multi-threaded designs. Graphic Processing Unit (GPU) is a great example of such high throughput multi-threaded designs. GPUs have emerged as an important computational platform for data-intensive applications in a plethora of application domains. They are commonly integrated in computing platforms at all scales, from mobile devices and embedded systems, to high performance enterprise-level cloud servers. GPUs use a massively multi-threaded architecture that exploits fine-grained switching between executing groups of threads to hide the latency of data accesses. In order to support this fast context switching at scale, GPUs invest in large Register Files (RF) to allow each thread to maintain its context in hardware. RF is a critical structure in GPUs responsible for a large portion of the area and power; the frequent accesses to the register file during kernel execution incur a sizable overhead in GPU power consumption, and introduce delays as accesses are serialized when port conflicts occur. This dissertation presents novel synergistic compiler/microarchitecture techniques for enabling high-performance and energy-efficient GPUs. Our first technique, CORF, is a compiler-assisted Coalescing Operand Register File which performs register coalescing by combining reads to multiple registers required by a single instruction, into a single physical read. To enable register coalescing, CORF utilizes register packing to co-locate narrow-width operands in the same physical register. Our proposed design uses compiler hints to identify which register pairs are commonly accessed together. This novel technique simultaneously reduces the leakage and dynamic access power, while improving the overall performance of the GPU. The second technique, Breathing Operand Windows to exploit bypassing in GPUs (BOW), is motivated by the observation that there is a high degree of temporal locality in accesses to the registers: within short instruction windows, the same registers are often accessed repeatedly. To exploit this opportunity, we propose an enhanced GPU pipeline and operand collector organization that supports bypassing register file accesses and instead passes values directly between instructions within the same window. To further arise bypassing opportunities, we introduce compiler optimizations to help guide the write-back destination of operands depending on whether they will be reused to further reduce the write traffic. Our results show that BOW can shield the register file from unnecessary register file accesses, which improves performance and reduces the energy consumption. In our third study, inspired by the fact that registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads, we propose Register Mutual Exclusion (RegMutex). RegMutex a software-hardware co-mechanism to enable sharing a subset of physical registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected register set into a base register set and an extended register set. While physical registers corresponding to the base register set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical registers across warps to provision their extended register set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of registers hence yielding higher performance per dollar. One of the most critical performance and design hurdles in today’s computing challenges is operating on a large volume of data. Large data not only impedes performance by imposing long-latency memory accesses, but also makes the processor design more costly by having the design to overprovision the on-chip memory size to afford the data. In our last study, we proposed another novel register sharing mechanism and also a warp scheduling scheme for GPUs to resolve these issues.Instead of modifying workloads to apply advanced algorithms or changing the GPU architecture significantly, our proposed locality-aware register file (LARF) and locality-aware scheduler (LAS) effectively reduce off-chip memory accesses and enable data sharing across warps in timely manner. We exploited the unique data sharing patterns of big data workloads such as deep learning and matrix multiply algorithms and have the warps opportunistically share data in register file. In our studies, we have observed a lot of cases where the amount of parallelism was limited largely by register shortage. With our proposed LARF, the register usage is also effectively reduced by having warps to share one physical copy of the register.