Understanding Software Application Behaviour in Presence of Permanent and Intermittent Hardware Faults
- Author(s): Sharma, Ankur
- Advisor(s): Gupta, Puneet
- et al.
Over past three decades technological advancement in fabrication of VLSI ICs has been accompanied by shrinking of device sizes and scaling of supply voltage. While power, area and performance have constantly improved, hardware reliability is becoming a growing concern. Due to increased process, voltage and temperature (PVT) variations, the infant mortality rate has gone up. Coupled with PVT variations, aging and wearout induced failures have exacerbated the problem as devices unexpectedly fail while in operation. Although a significant fraction of emerging failure and wearout mechanisms result in intermittent or permanent faults in the hardware, their impact (as distinct from transient faults) on software applications has not been well studied. In this work, we analyze the impact of such failures on software applications and develop a distinguishing application characteristic, referred to as similarity from basic circuit-level understanding of the failure mechanisms. We present a mathematical definition and approximations for similarity computation for practical software applications and experimentally verify the relationship between similarity and fault rate. Leveraging the dependence of application robustness on similarity metric, we present example architecture independent code transformations to reduce similarity and thereby the worst case fault rate with minimal performance degradation. The experiments with arithmetic unit faults show as much as 74% improvement in the worst case fault rate on benchmark kernels with less than 10% performance degradation.