System reliability is becoming a significant concern as technology continues to shrink. This is because of increasing variation in circuit characteristics seen in nanometer-scaled microsystems. These variations in semiconductor manufacturing manifest as increasing fault rates in devices, rising soft errors, and timing errors caused by accelerated aging in circuits. This dissertation seeks software-based techniques to detect, recover, and prevent such errors in compute and memory components.
Software-based error detection and recovery techniques suffer from high-performance penalty to the overall system. This thesis presents methods that minimize the performance overhead of software-based error mitigation. In particular, this work proposes two techniques for effective software-based error detection in compute units: fingerprinting and cross-laneinstruction. Fingerprinting combines multiple error detection events into one event by hashing, and cross-lane instruction enables error checking via low latency register-level communication. Furthermore, to reduce the performance overhead of software-based recovery, this thesis explores Application-Specific Approximate Recovery (ASAR). ASAR trades-off output quality to reduce the performance penalty of software-based recovery.
Variation affects not just compute but memory components, as well. For memory units, we focus on methods for proactive error prevention. Given the diversity of memory components in use, we focus on emerging Heterogeneous Memory Architectures (HMAs). An HMA consists of multiple memory modules with different performance and reliability characteristics. These differences can be caused by different types of memory modules, different error correcting codes, and effects due to aging. This thesis focuses on methods to place and move data items among memory modules in an HMA system with the goal of reducing the likelihood of encountering an error. Specifically, this work describes two novel data placement techniques: age-aware and vulnerability-aware data placement. The age-aware technique monitors the accumulation of faults in different memory modules as they age, while the vulnerability-aware technique estimates the vulnerability of data in memory to soft errors. The results presented in this work enable practical use of software-based error mitigation solutions for current and future hardware.