Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Lightweight Fault Tolerance in SRAM Based On-Chip Memories

Abstract

The reliability of memory subsystem is fast becoming a concern in computer architecture and system design. From on-chip embedded memories in Internet-of-Things (IoT) devices and on-chip caches to off-chip main memories, they have become the limiting factor in reliability of computing systems. This is because they are primarily designed to maximize bit storage density; this makes memories particularly sensitive to manufacturing process variation, environmental operating conditions, and aging-induced wearout. Addressing these concerns is particularly challenging in on-chip caches or embedded memories like scratchpads in IoT devices as additional area, power and latency overheads of reliability techniques in these memories need to be minimized as much as possible. Hence, this dissertation proposes Lightweight Fault Tolerance in SRAM based scratchpad memories and last level caches.

In the first part of the dissertation we propose FaultLink: an approach to deal with known hard faults in software managed scratchpad memories. FaultLink avoids hard faults found during testing by generating a custom-tailored application binary image for each individual chip.

During software deployment-time, FaultLink optimally packs small sections of program code and data into fault-free segments of the memory address space and generates a custom linker script for a lazy-linking procedure. The second part proposes two software defined lightweight error detection and correction techniques: Software Defined Error Localization Code (SDELC) and Parity++ to recover from soft errors during run time. SDELC is mostly for embedded memories and uses novel and inexpensive Ultra-Lightweight Error-Localizing Codes (UL-ELCs). These require fewer parity bits than single-error-correcting Hamming codes.

Yet our UL-ELCs are more powerful than basic single-error-detecting parity: they localize single-bit errors to a specific chunk of a codeword.

SDELC then heuristically recovers from these localized errors using a small embedded C library that exploits observable side information (SI) about the application's memory contents.

Parity++ is a novel unequal message protection scheme that preferentially provides stronger error protection to certain "special messages". This protection scheme provides Single Error Detection (SED) for all messages and Single Error Correction (SEC) for a subset of special messages. Parity++ can be used in both last level caches and lightweight embedded memories.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View