UC San Diego
Architectural support for efficient on-chip parallel execution
- Author(s): Brown, Jeffery Alan
- et al.
Exploitation of parallelism has for decades been central to the pursuit of computing performance. This is evident in many facets of processor design: in pipelined execution, superscalar dispatch, pipelined and banked memory subsystems, multithreading, and more recently, in the proliferation of cores within chip multiprocessors (CMPs). As designs have evolved, and the parallelism dividend of each technique have been exhausted, designers have turned to other techniques in search of ever more parallelism. The recent shift to multi-core designs is a profound one, since available parallelism promises to scale farther than at prior levels, limited by interconnect degree and thermal constraints. This explosion in parallelism necessitates changes in how hardware and software interact. In this dissertation, I focus on hardware aspects of this interaction, providing support for efficient on-chip parallel execution in the face of increasing core counts. First, I introduce a mechanism for coping with increasing memory latencies in multithreaded processors. While prior designs coped well with instruction latencies in the low tens of cycles, I show that long latencies associated with stalls for main memory access lead to pathological resource hoarding and performance degradation. I demonstrate a reactive solution which more than doubles throughput for two-thread workloads. Next, I reconsider the design of coherence subsystems for CMPs. I show that implementation of a traditional directory protocol on a CMP fails to take advantage of the latency and bandwidth landscape typical of CMPs. Then, I propose a CMP-specific customization of directory-based coherence, and use it to demonstrate overall speedup, reduced miss latency, and decreased interconnect utilization. I then focus on improving hardware support for multithreading itself, specifically for thread scheduling, creation, and migration. I approach this from two complementary directions. First, I augment a CMP with support for rapidly transferring register state between execution pipelines and off-core thread storage. I demonstrate performance improvement from accelerated inter -core threading, both by scheduling around long-latency stalls as they occur, and by running a conventional multi- thread scheduler at higher sample rates than would be possible with software alone. Second, I consider a key bottleneck for newly-forked and newly-rescheduled threads: the lack of useful cached working sets, and the inability of conventional hardware to quickly construct those sets. I propose a solution which uses small hardware tables that monitor the behavior of executing threads, prepares working-set summaries on demand, and then uses those summaries to rapidly prefetch working sets when threads are forked or migrated. These techniques as much as double the performance of newly-migrated threads