With technology scaling, circuit performance has become more sensitive to various sources of variability, including manufacturing variations, ambient fluctuations, and circuit wear-out. These increased variations have created new challenges for conventional hardware guardbanding, as the additional design margin diminishes the benefits of technology scaling. This dissertation aims at reducing total system design margin with cross-layer approaches on monitoring, margining and mitigation of circuit variability.
Since hardware and software adaptation can be used to reduce design margin with the
exposed hardware variability provided by hardware monitors, we start by proposing two
different types of performance monitors that can achieve better monitoring accuracy and
smaller monitoring overhead. We also demonstrate the use of these performance monitors in system adaptation with our end-to-end implementation of software testbeds.
We also study the dynamic variations and reliability margining problem in presence of
monitor-and-actuate adaptation and emerging system contexts. In a system with monitor-and-actuate adaptation, dynamic variations require extra margin for monitor and actuate latencies. We analyze and study the margining problem considering different choices of the monitor and actuator types. System reliability margining strategies are also proposed for circuits in the “dark silicon” era, where the low-level design margin should consider the contexts of high-level power/thermal constraints.
Last, we propose a clock gating methodology to mitigate the aging induced clock skew,
which is difficult to monitor and resolve through adaptation. For certain phenomena and
variation sources, for example, soft error rates at different location/altitude, we also propose
system/cloud-based monitors. An emulation platform is built to study the impacts of
dynamic power management schemes on system reliability.