Resilient On-Chip Memory Design in the Nano Era
- Author(s): Banaiyanmofrad, Abbas
- Advisor(s): Dutt, Nikil
- et al.
Aggressive technology scaling in the nano-scale regime makes chips more susceptible to failures. This causes multiple reliability challenges in the design of modern chips, including manufacturing defects, wear-out, and parametric variations. By increasing the number, amount, and hierarchy of on-chip memory blocks in emerging computing systems, the reliability of the memory sub-system becomes an increasingly challenging design issue. The limitations of existing resilient memory design schemes motivate us to think about new approaches considering scalability, interconnect-awareness, and cost-effectiveness as major design factors. In this thesis, we propose different approaches to address resilient on-chip memory design in computing systems ranging from traditional single-core processors to emerging many-core platforms. We classify our proposed approaches in five main categories: 1) Flexible and low-cost approaches to protect cache memories in single-core processors against permanent faults and transient errors, 2) Scalable fault-tolerant approaches to protect last-level caches with non-uniform cache access in chip multiprocessors, 3) Interconnect-aware cache protection schemes in network-on-chip architectures, 4) Relaxing memory resiliency for approximate computing applications, and 5) System-level design space exploration, analysis, and optimization for redundancy-aware on-chip memory resiliency in many-core platforms. We first propose a flexible fault-tolerant cache (FFT-Cache) architecture for SRAM-based on-chip cache memories in single-core processors working at near-threshold voltages. Then, we extend the technique proposed in FFT-Cache, to protect shared last-level cache (LLC) with Non-Uniform Cache Access (NUCA) in chip multiprocessor (CMP) architectures, proposing REMEDIATE that leverages a flexible fault remapping technique while considering the implications of different remapping heuristics in the presence of cache banking, non-uniform latency, and interconnected network. Then, we extend REMEDIATE by introducing RESCUE with the main goal of proposing a design trend (aggressive voltage scaling + cache over-provisioning) that uses different fault remapping heuristics with salable implementation for shared multi-bank LLC in CMPs to reduce power while exploring a large design space with multiple dimensions and performing multiple sensitivity analysis. Considering multibit upsets, we propose a low-cost technique to leverage embedded erasure coding (EEC) to tackle soft errors as well as hard errors in data caches of a high-performance as well as an embedded processor. Considering non-trivial effect of interconnection fabric in memory resiliency of network-on-chip (NoC) platforms, we then propose a novel fault-tolerant scheme that leverages the interconnection network to protect the LLC cache banks against permanent faults. During a LLC access to a faulty area, the network detects and corrects the faults, returning the fault-free data to the requesting core. In another approach, we propose CoDEC, a Co-design approach to error coding of cache and interconnect in many-core architectures to reduce the cost of error protection compared to conventional methods. Proposing a system-wide error coding scheme, CoDEC guarantees end-to-end protection of LLC data blocks throughout the on-chip network against errors. Observing available tradeoffs among reliability, output fidelity, performance, and energy in emerging error-resilient applications in approximate computing era motivates us to consider application-awareness in resilient memory design. The key idea is exploiting the intrinsic tolerance of such applications to some level of errors for relaxing memory guard-banding to reduce design overheads. As an exemplar we propose Relaxed-Cache, in which we relax the definition of faulty block depending on the number and location of faulty bits in a SRAM-based cache to save energy. In this part of thesis, we aim at cross-layer characterization and optimization of on-chip memory resiliency over the system stack. Our first contribution toward this approach is focusing more on scalability of memory resiliency as a system-level design methodology for scalable fault-tolerance of distributed on-chip memories in NoCs. We introduce a novel reliability clustering model for effective shared redundancy management toward cost-efficient fault-tolerance of on-chip memory blocks. Each cluster represents a group of cores that have access to shared redundancy resources for protection of their memory blocks.