System-Level Electromigration-Induced Dynamic Reliability Management
- Author(s): Kim, Taeyoung
- Advisor(s): Tan, Sheldon X.-D.
- et al.
Technology scaling has led to further processor integration, and future manycore chips will have more cores integrated. However, due to the diminishing of Dennard’s scaling, the power density of chips starts to increase for current and future technology nodes. Because of this, only a certain percentage of a manycore processor can be powered on because of power and temperature limitations. These trends have resulted in so-called dark silicon manycore processors. Additionally, reliability is becoming a limiting constraint in high-performance nanometer VLSI chip designs due to the high failure rates in deep submicron and nanoscale devices. It is expected that future chips will show signs of reliability-induced aging much faster than the previous generations. Among of many reliability effects, electromigration (EM)-induced reliability has become a major design constraint due to the aggressive transistor and interconnect scaling and increasing power density.
This thesis focuses on developing new system level EM-induced dynamic reliability managements on many different systems. Specifically, first, I develop system level management for real-time embedded systems. I investigate a new lifetime optimization technique for real-time embedded processors considering the electromigration-induced reliability. The new approach is based on a recently proposed physics-based electromigration (EM) model for more accurate EM assessment of a power grid network at the chip level. Second, I develop a new energy and lifetime optimization techniques for emerging dark silicon manycore microprocessors considering both hard long-term reliability effects (hard errors) and transient soft errors. To optimize EM-induced lifetime, I apply the adaptive Q-learning based method, which is suitable for dynamic runtime operation as it can provide cost-effective yet good solutions. Third, I develop a new dynamic reliability management (DRM) techniques at the system level for emerging low power dark silicon manycore microprocessors operating in near-threshold region. I mainly consider the electromigration (EM) recovery effects. To leverage the EM recovery effects, which was ignored in the past, at the system-level, I develop a new equivalent DC current model to consider recovery effects for general time-varying current waveforms so that existing compact EM model can be applied. Fourth, I develop a new approach for cross-layer electromigration (EM) induced reliability modeling and optimization at physics, system and data center levels. To speed up the online optimization for energy in a data center, I investigate a new combined data center power and reliability compact model using a learning based approach in which a feed-forward neural network (FNN) is trained to predict energy and long term reliability for each processor under data center scheduling and workloads. Lastly, I develop long-term reliability management for GPU architectures using spatial multitasking, which allows GPU computing resources to be partitioned among multiple applications. I find that the existing reliability-agnostic thread block scheduler for spatial multitasking is effective in achieving high GPU utilization, but poor in reliability. I develop and implement a long-term reliability-aware thread block scheduler in GPGPU-sim, and compare it against the existing reliability-agnostic scheduler.