Design and architecture of automatically-generated energy- reducing coprocessors
- Author(s): Sampson, John Morgan
- et al.
For many years, improvements to CMOS process technologies fueled rapid growth in processor performance and throughput. Each process generation brought exponentially more transistors and exponentially reduced the per- transistor switching power. However, concerns over leakage currents have moved us out of the classical CMOS scaling regime. Although the number of available transistors continues to rise, their switching power no longer declines. In contrast to transistor counts, power budgets remain fixed due to limitations on cooling or battery life. Thus, with each new process generation, an exponentially decreasing fraction of the available transistors can be simultaneously switched. The growing divide between available transistors and utilizable transistors leads to a utilization wall. This dissertation characterizes the utilization wall and proposes conservation cores as a means of surmounting its most pressing challenges. Conservation cores, or C-Cores, are application-specific hardware circuits created to reduce energy consumption on computationally-intensive applications with complex control logic and irregular memory access patterns. C-Cores are drop-in replacements for existing source code, and make use of limited reconfigurability to adapt to software changes over time. The design and implementation of these specialized execution engines pose challenges with respect to code selection, automatic synthesis, choice of programming model, longevity/robustness, and system integration. This dissertation addresses many of these challenges through the development of an automated conservation core toolchain. The toolchain automatically extracts the key kernels from a target workload and uses a custom C-to- silicon infrastructure to generate 45ñm implementations of the C-Cores. C-Cores employ a new pipeline design technique called pipeline splitting, or pipesplitting. This technique reduces clock power, increases memory parallelism, and further exploits operation-level parallelism. C-Cores also incorporate specialized energy- efficient per-instruction data caches called cachelets into the datapath, which allow for sub-cycle cache- coherent memory accesses. An evaluation of C-Cores against an efficient in-order processor shows that C-Cores speed up the code they target by 1.5x, improve EDP by 6.9x and accelerate the whole application by 1.33x on average, while reducing application energy-delay by 57%