To avoid immoderate power consumption, the chip industry has shifted away from highperformance single threaded designs to high throughput multi-threaded designs. Graphic
Processing Unit (GPU) is a great example of such high throughput multi-threaded designs.
GPUs have emerged as an important computational platform for data-intensive applications
in a plethora of application domains. They are commonly integrated in computing platforms
at all scales, from mobile devices and embedded systems, to high performance enterprise-level
cloud servers.
GPUs use a massively multi-threaded architecture that exploits fine-grained switching between executing groups of threads to hide the latency of data accesses. In order to
support this fast context switching at scale, GPUs invest in large Register Files (RF) to
allow each thread to maintain its context in hardware. RF is a critical structure in GPUs
responsible for a large portion of the area and power; the frequent accesses to the register
file during kernel execution incur a sizable overhead in GPU power consumption, and introduce delays as accesses are serialized when port conflicts occur. This dissertation presents novel synergistic compiler/microarchitecture techniques for enabling high-performance and
energy-efficient GPUs.
Our first technique, CORF, is a compiler-assisted Coalescing Operand Register
File which performs register coalescing by combining reads to multiple registers required
by a single instruction, into a single physical read. To enable register coalescing, CORF
utilizes register packing to co-locate narrow-width operands in the same physical register.
Our proposed design uses compiler hints to identify which register pairs are commonly
accessed together. This novel technique simultaneously reduces the leakage and dynamic
access power, while improving the overall performance of the GPU.
The second technique, Breathing Operand Windows to exploit bypassing in GPUs
(BOW), is motivated by the observation that there is a high degree of temporal locality
in accesses to the registers: within short instruction windows, the same registers are often
accessed repeatedly. To exploit this opportunity, we propose an enhanced GPU pipeline
and operand collector organization that supports bypassing register file accesses and instead
passes values directly between instructions within the same window. To further arise
bypassing opportunities, we introduce compiler optimizations to help guide the write-back
destination of operands depending on whether they will be reused to further reduce the write
traffic. Our results show that BOW can shield the register file from unnecessary register file
accesses, which improves performance and reduces the energy consumption.
In our third study, inspired by the fact that registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads, we propose Register
Mutual Exclusion (RegMutex). RegMutex a software-hardware co-mechanism to enable sharing a subset of physical registers between warps during the GPU kernel execution. With
RegMutex, the compiler divides the architected register set into a base register set and an
extended register set. While physical registers corresponding to the base register set are
statically and exclusively assigned to the warp, the hardware time-shares the remaining
physical registers across warps to provision their extended register set. Therefore, the GPU
programs can sustain approximately the same performance with the lower number of registers
hence yielding higher performance per dollar.
One of the most critical performance and design hurdles in today’s computing
challenges is operating on a large volume of data. Large data not only impedes performance
by imposing long-latency memory accesses, but also makes the processor design more costly
by having the design to overprovision the on-chip memory size to afford the data. In our last
study, we proposed another novel register sharing mechanism and also a warp scheduling
scheme for GPUs to resolve these issues.Instead of modifying workloads to apply advanced
algorithms or changing the GPU architecture significantly, our proposed locality-aware
register file (LARF) and locality-aware scheduler (LAS) effectively reduce off-chip memory
accesses and enable data sharing across warps in timely manner. We exploited the unique
data sharing patterns of big data workloads such as deep learning and matrix multiply
algorithms and have the warps opportunistically share data in register file. In our studies,
we have observed a lot of cases where the amount of parallelism was limited largely by
register shortage. With our proposed LARF, the register usage is also effectively reduced by
having warps to share one physical copy of the register.