Efficient thermal management for multiprocessor systems
- Author(s): Coşkun, Ayşe Kıvılcım
- et al.
High temperatures and large thermal variations on the die create severe challenges in system reliability, performance, leakage power, and cooling costs. Designing for worst-case thermal conditions is highly costly and time-consuming. Therefore, dynamic thermal management methods are needed to maintain safe temperature levels during execution. Conventional management techniques sacrifice performance to control temperature and only consider the hot spots, neglecting the effects of thermal variations. This thesis focuses on developing performance- efficient techniques to achieve safe and balanced thermal profiles on multiprocessor system-on-chips (MPSoCs). Modeling performance, temperature, and reliability of MPSoCs with high accuracy and reasonable simulation time is a challenge, because we need to keep track of instruction-level activities and also simulate sufficiently long real-time execution to have meaningful reliability estimates. The first contribution of this thesis is a fast simulation framework, which evaluates reliability of runtime policies or design-time decisions accurately in a matter of hours--whereas traditional architecture-level simulators would have to run for days. Job scheduling on an MPSoC has a significant impact on temperature and reliability. For systems with a priori known workloads, this thesis proposes a scheduling optimization method which outperforms other static energy or temperature management techniques in terms of reducing thermal hot spots and gradients. However, having an accurate design-time workload estimate is not possible for most systems. This work introduces dynamic techniques to address runtime variations in workload. The key aspects of these dynamic techniques are low-performance impact and adaptation capability. Reacting after thermal events occur reduces the efficiency of thermal management policies. This thesis proposes a novel proactive management approach to address this issue, and shows that utilizing a thermal forecast for temperature-aware scheduling achieves significant gains in both temperature and performance. All the novel management policies introduced in this thesis are evaluated using an experimental framework based on real-life systems and workloads. In the experiments on an UltraSPARC T1 processor, proactive thermal management achieves remarkable results with an average 60% reduction in hot spot occurrences, 80% reduction in spatial gradients and 75% reduction in thermal cycles in comparison to reactive thermal management, while also improving performance