- Main
Thermal and Power Estimation and Reliability Management for Commercial Multi-Core Processors
- Zhang, Jinwei
- Advisor(s): Tan, Sheldon
Abstract
Power, thermal, and related reliability issues are among the major limiting factors for today’s high performance multi-core processors. This is especially true after the breakdown of the so-called Dennard scaling, since power density starts to increase as IC technology advances. To enhance reliability, researchers have proposed many power/thermal regulation or dynamic management methods, including clock gating, power gating, dynamic voltage and frequency scaling (DVFS), and task migration. In this thesis, we present our findings to address the challenges of post-silicon power and thermal characterization, and dynamic thermal managements for lifetime reliabilities. We first address the problem of accurate full-chip power and thermal map estimation for commercial off-the-shelf multi-core processors. The novel scheme is developed to generate the true 2D power density maps based on the thermal measurements of the processor with backside cooling and facilitated with an advanced infrared (IR) thermal imaging system. the proposed method achieves both higher resolution and considerable speedup than a recently proposed state-of-art method. Then the second, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial TPU chips based on a machine-learning technique for the first time. In detail, we achieve estimating the spatial power for commercial TPUs from the hyperparameters of the neural networks (workloads) that are deployed on the TPUs in real-time. Thirdly, processors operating with heat sink cooling remains a challenging problem due to the difficulty in direct measurement. We build an FEM model to reconstruct the full-chip thermal maps for commercial processors while they are under heat sinks. Lastly, based on the spatial power characterization, we propose a new dynamic thermal and reliability management framework via task mapping and migration to improve the thermal performance and lifetime reliability of commercial multi-core processors. Compared to the existing works, the new approach is the first to optimize VLSI reliabilities by exploring workload-dependent power hot spots. The advantages of the proposed method over the Linux baseline task mapping and the temperature-based mapping method are demonstrated and validated on real commercial processors.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-