General-purpose graphics processing unit (GPGPU) is one of the most popular many-core accelerators
that deliver a massive computing power in parallel applications. GPGPUs mainly
rely on the hardware multithreading to hide a short pipeline stall and a long memory latency.
Thus, the performance of GPGPU can be signicantly aected by how GPGPU's
hardware multithreading is applied. However, nding the optimal hardware multithreading
is a complex problem since there are many aspects to be considered. This work studies the
mechanisms for improving the eectiveness of hardware multithreading. First, it studies
the various scheduling policies and proposes an adaptive scheduling policy that chooses the
best scheduling policy at runtime. In addition, it proposes simple but eective warp throttling
mechanism that can increase the cache locality. Furthermore, it proposes a hardware
prefetching mechanism to extend the memory latency hiding degree of hardware multithreading.
Finally, it shows how a limited scalability of the conventional cache miss handling architecture
constrains the degree of hardware multithreading and proposes the highly scalable
cache miss handling architecture.