System-Level Thermal Modeling and Management for Multi-Core and 3D Microprocessors
The continuously scaling down of CMOS technology inevitably increases the power density for high performance microprocessors, which makes thermal effects and related problems urgent and challenging. Unpredicted thermal behavior and on-chip thermal hot spots could lead to performance degradation of microprocessor chips, incurring reliability issues. Hence, it is becoming increasingly important to develop thermal modeling methods to predict the thermal behavior of microprocessor chips, and thermal management techniques to control the on-chip thermal hot spots and thermal related reliability issues.
This research focuses on system level thermal behavior modeling and management methods to enable the thermal aware design of high performance microprocessor chips and packages, which also considers the thermal related reliability issues. First, at chip level, a new compact lateral thermal resistance model is proposed to model the lateral thermal behavior of through silicon vias (TSVs), which was largely ignored previously. The proposed lateral thermal model is fully compatible with the existing thermal modeling method of TSVs, and could be integrated into finite difference (FD) simulation to improve the accuracy. Second, targeting at the package level modeling of thermal behavior for microprocessor chip package, the top-down approach building the thermal behavioral models from the given accurate temperature and power information by means of the subspace identification method (SID) is systematically explored in this dissertation research. Power map based approach and piecewise linear modeling method are developed to improve the accuracy of the identified model in presence of thermal nonlinearity and correlated power traces. Third, a more effective architecture level distributed thermal management method is developed to balance the on-chip temperature distribution in this dissertation research. A new temperature metric called effective initial temperature that incorporates both initial temperature and other transient thermal effects is proposed to make optimized task migration decisions in the new distributed thermal management method, leading to more effective reduction of thermal hot spots. The last but not least, since temperature imposes exponential influence on the reliability of the chip, this research also proposes a new system level reliability model derived from fundamental physics principles, in which reliability is modeled as life time resources that are to be consumed as the chip works. Based on this model, a dynamic management method is proposed to effectively balance and compensate the life time resources across all the cores in a multi-core processor system, preventing the chance of early failure of heavily loaded cores.