Multi-layer memory resiliency

With memories continuing to dominate the area, power, cost and performance of a design, there is a critical need to provision reliable, high-performance memory bandwidth for emerging applications. Memories are susceptible to degradation and failures from a wide range of manufacturing, operational and environmental effects, requiring a multi-layer hardware/software approach that can tolerate, adapt and even opportunistically exploit such effects. The overall memory hierarchy is also highly vulnerable to the adverse effects of variability and operational stress. After reviewing the major memory degradation and failure modes, this paper describes the challenges for dependability across the memory hierarchy, and outlines research efforts to achieve multi-layer memory resilience using a hardware/software approach. Two specific exemplars are used to illustrate multi-layer memory resilience: first we describe static and dynamic policies to achieve energy savings in caches using aggressive voltage scaling combined with disabling faulty blocks; and second we show how software characteristics can be exposed to the architecture in order to mitigate the aging of large register files in GPGPUs. These approaches can further benefit from semantic retention of application intent to enhance memory dependability across multiple abstraction levels, including applications, compilers, run-time systems, and hardware platforms.


INTRODUCTION
The advent of many-core computing platforms exacerbates the classical processor-memory performance bottleneck.Traditionally, memory hierarchies have attempted to address this performance bottleneck by keeping frequently accessed data close to where they are consumed (e.g., by caching).However, contemporary design processes also need to guarantee other nonfunctional constraints such as power, energy and thermal bounds.Furthermore, since memories occupy a significant percentage of a chip's area, the memory subsystem has become vulnerable to a host of manufacturing, environmental, and operational failure/degradation mechanisms that affect the overall resiliency of the system.This paper outlines memory resilience challenges and opportunities across and between multiple levels of abstraction in a typical hardware/software design flow for computing systems (see Figure 1).The overall discussion is focused on systems-on-chip (SoCs), although similar analyses can be made for large-scale distributed systems as well.Section 2 describes memory abstractions across the design hierarchy shown in Figure 1, the typical causes of memory errors, and error manifestations at each abstraction level.Sections 3 and 4 use memory voltage scaling and wearout, respectively, as exemplars for multi-layer memory resiliency approaches.Section 5 outlines challenges for managing manufacturing variability and describes memory-related efforts within the NSF Variability Expedition project that aims to opportunistically exploit and manage hardware variability through software mechanisms.Section 6 closes with the outlook for multi-level memory resilience.

Figure 1. Memory Abstractions, Errors, and Opportunities.
This paper is part of the DAC special session on "Embedded Resiliency: Approaches for the Next Decade".Other papers in this session are: "Monitoring Reliability in Embedded Processors -A Multi-layer View" [68], "Multi-Layer Dependability: From Microarchitecture to Application Level" [69], and "Workloadand Instruction-Aware Timing Analysis -The missing Link between Technology and System-level Resilience" [70].

MEMORIES AND ERRORS
Figure 1 shows the typical hardware/software abstraction layers for computing systems.Each row of Figure 1 describes the system abstraction layer, the memory abstraction at that level, and typical manifestations of memory errors that can compromise system resiliency.The last column of Figure 1 describes opportunities for relaxed and approximate computing in the face of memory error manifestations at that level of abstraction.Memory errors manifest themselves in different ways across abstraction stack.For instance, an unstable memory cell at the circuit/device level can cause a bit failure at the memory logic level, which in turn might propagate up the abstraction stack as a faulty memory access at the architecture level, a wrong function call or system halt at OS-level, and finally an output error or an exception at application layer.
Figure 1 represents a symbolic abstraction of memory errors over the entire hardware/software system stack.Traditionally, memory resilience has been addressed via disparate techniques at each level of design abstraction, while newer efforts attempt to couple strategies across layers with the goal of improving system efficiency for energy, heat dissipation, lifetime, cost, etc.Furthermore, efforts in relaxed and approximate computing attempt to create designs that can trade off application quality for these system efficiency goals.
To understand memory faults, we can classify them by their temporal behaviors (persistence) as well as their causes.With respect to persistence, a memory fault can be permanent or transient.Permanent faults persist indefinitely in the system after occurrence, while transient faults manifest for a relatively short period of time after occurrence.Furthermore, causes of memory faults can be hard or soft.Hard faults are static and caused by device failure or wear-out failure.In contrast, soft faults are dynamic and are typically caused by the operating environment.
Memories suffer from different sources of unreliability that can be classified into three main groups:

•
Manufacturing.Worsening manufacturing imperfections in nanoscale technologies result in increasing variability of device and circuit-level parameters.This process variation particularly affects transistor threshold voltages through random dopant fluctuation (RDF), increasing the likelihood of memory cells failing permanently due to insufficient noise margins at a given supply voltage.
• Environmental.Alpha particle radiation coming from the operating environment can cause single event upsets (SEU).
Combined with weakened noise margins from manufacturing effects, memory cells are also becoming more susceptible to SEU, impacting their soft error resilience [1].Noise stemming from variations in the supply voltage and thermal effects can also cause memory faults exhibiting dynamic and random behavior.
• Aging and Wearout.Depending on the type of technology used, memory cells can age, reducing their performance, data retention capability, and power consumption.Aging can lead to memory wearout, resulting in permanent faults.
Different memory technologies suffer from various sources of unreliability.Volatile memories such as SRAM and DRAM mostly suffer from manufacturing defects and environmental issues that lead to hard and soft errors, respectively.Endurance is not an issue in SRAM and DRAM.In contrast, different nonvolatile memories (NVMs) have their own sources of unreliability.For flash and phase change memory (PCM), wearout is the primary source of unreliability due to limited write endurance.PCMs also suffer from hard and soft errors [2].Other emerging NVMs such as MRAM and its newer cousin STT-RAM also suffer from hard and soft errors.However, for these devices, wearout is not as great of a reliability threat, because they have large write endurances similar to that of SRAM.
The design of reliable computer systems has a rich history spanning several decades: variants of spatial, temporal, and information redundancy have been exploited to improve reliability.Memory systems also deploy these forms of redundancy to achieve resilience across various layers of system abstraction.Additionally, memory designers have leveraged a variety of other memory-specific techniques.
Here, we provide a sampling of common techniques used for reliable memory design at the architectural level.A significant body of research exists on the design of a reliable memory hierarchy comprising multiple levels of caches and main memory.Fault-tolerant memory designs have often used simple techniques such as adding redundant rows/columns to the memory array [18] or applying memory down-sizing techniques by disabling a faulty row or cache line (block) [20].
Information redundancy via error coding is also commonly used to improve the reliability of memory components.Wide ranges of error detection and correction codes (EDC and ECC, respectively) have been used [7].Typically, EDCs are simple parity codes, while the most common ECCs use Hamming [8] or Hsiao [9] codes.ECC is proven as an effective mechanism for handling soft errors.For NVMs that have limited write endurance, various wear-leveling approaches have been proposed to mitigate aging and extend memory lifetime.
For many embedded applications, hardware controlled caches do not provide predictable performance and can also be energy inefficient.Consequently, caches are increasingly replaced by or augmented with software-controlled scratchpad memories (SPMs).The design of reliable SPMs has also received great attention recently, including efforts that address the reliability of SPMs for chip-multiprocessors (E-RoC [15] and SPMVisor [16]), or for hybrid memories (FTSPM [17]).
Surprisingly, very little work has attempted to leverage higherlevel semantic retention [67] to assist at all levels of unreliability.Indeed, by having a "big-picture" understanding of what data structures/parts-thereof are accessed, how frequently, and in what way during a program phase, and relating these to the fault profiles of the underlying memory subsystems, one could improve the efficiency of (or even eliminate the need for) recovery mechanisms in both hardware and software.
An exhaustive survey of memory resilience is beyond the scope of this paper.However, in the next two sections we present two recent research topics -resilient caches and memory aging -as vehicles to illustrate opportunities for multi and cross-layer memory resilience.For each case, we briefly explain ongoing efforts and highlight an exemplar study that leverages a multilayer approach toward improving memory resilience.

RESILIENT CACHES
We can categorize resilient SRAM cache design efforts into three main groups.Many of these have the common property of "faulttolerant voltage-scalable" (FTVS) design, because low voltage operation -while critical for achieving power and energy savings -is the primary driver behind unreliable memories.In general, regardless of whether the fault-tolerant design is done at the cell, circuit, coding, or architecture level, there is a tradeoff in terms of memory capacity and area.This may be due to larger memory cells, spare or redundant cells, error correction logic, or a reduced amount of reliable memory available for use by the application.

Cell and Circuit-Level Techniques
The root of most SRAM reliability problems is the cell noise margin.At low supply voltages, noise margins are reduced, increasing susceptibility to data corruption caused by environmental factors described earlier.Furthermore, variability in cell noise margins requires a statistical approach to designing a reliable memory array and choice of minimum supply voltage, which must be increased to maintain yield under large variations.
Engineers have designed larger memory cells using more transistors and/or larger transistors to increase mean noise margins and/or reduce margin variability, but these come at the cost of reduced area efficiency and sometimes power.Several of these circuit-level techniques include 8T [3][4], 10T [5], and Schmidt Trigger (ST) [6] SRAM cells.

Error Coding Techniques
Single error correction double error detection (SECDED) is a widely used coding technique for protecting memory structures against soft errors.When greater error detection is necessary, more complex multi-bit error correction schemes have also been proposed.Double error correction triple error detection (DECDED), two-dimensional ECC (2D-ECC) [10], multiple-bit segmented ECC (MS-ECC) [11], Hi-ECC [12], variable-strength ECC (VS-ECC) [13], and Memory Mapped ECC [14] are some of the more notable schemes.Besides common codes such as Hamming [8] and Hsiao [9], other strong codes such as BCH [12], OLSC [11], and Reed Solomon [7] have also been used to gain strong error detection.However, ECC techniques generally come at high cost due to significant memory storage and logic overheads.Despite this, ECC remains a popular method for memory resilience due to its effectiveness against soft errors, and the lack of involvement from other layers of abstraction.

Architecture-Level Techniques
Many architecture-level schemes deploy redundancy or capacity downsizing techniques to improve the reliability of cache memories.Earlier works on fault-tolerant cache design use simple techniques by adding redundant rows/columns to the cache [18] or disabling faulty cache block, sets, and/or ways [20].Similarly, Wilkerson et al. [21] proposed multiple techniques using part of a cache line as redundancy for defective bits for the rest of cache lines in the same set.PADed cache [19] and Agarwal's design [1] program column multiplexer and address decoders to select nonfaulty blocks, respectively.
Wilkerson's scheme [21] also could fall under this category.
In all the above schemes, algorithmic and compiler semantic retention could help enhance the efficiency of the proposed mechanisms, by facilitating more accurate remapping, accurate (more limited) replication, and/or more efficient relocation approaches.Some hybrid schemes combine multiple techniques mentioned earlier to minimize the costs of memory protection.Zhou [30] minimizes area overhead through joint optimization of cell size, redundancy, and ECC; and Ndai [31] performs circuitarchitecture codesign for memory yield improvement.

Power/Capacity Scaling
We now turn to our most recent work [37] as an exemplar for cross-layer resilient cache design.Many works in resilient SRAM caches target power reduction by enabling low voltage operation.As described earlier in Section 2, low voltage operation results in higher probability of faulty memory cells, thus requiring some form of fault tolerance.Thus, there is a tradeoff between power (as it depends on supply voltage) and fault tolerance overheads (in terms of area, performance, and power).Despite this, most faulttolerant voltage-scalable (FTVS) SRAM cache designs emphasize the metric of minimum achievable VDD at fixed yield.This can be misleading when judging the efficacy of such an approach.
Thus, we proposed in [37] a better metric for evaluating FTVS SRAM caches: power versus effective capacity.For example, one can consider an ECC-based cache as either having a power overhead for a given amount of bit storage, or for a given amount of power, fewer bits that are usable to store data.These sorts of tradeoffs are captured appropriately by this metric, and enable more effective cross-layer design.
We realized that employing sophisticated ECC, block-level redundancy or address remapping can achieve very low supply voltages, but not the best design tradeoff in power vs. capacity.When voltage scaling an SRAM array, there is a critical point where the memory becomes virtually useless due to very high bit error rates.Fault tolerance mechanisms allow incrementally lower voltages, but at ever-increasing costs in area, power, performance, and complexity.Thus, it appears that tolerating many errors for low voltage operation can quickly become a fool's errand.
In [37] this realization led us to come up with a simple FTVS SRAM cache architecture for energy savings.The idea is to achieve a better power/capacity tradeoff for a cache by using ultra-lightweight fault tolerance that gracefully degrades cache utility as voltage is lowered.Essentially, an offline or built-inself-test (BIST) routine identifies blocks that have any faulty bits at each pre-determined VDD level.Using the so-called fault inclusion property [37], we keep a very small fault map (1-2 bits per block) in the tag array, which is not voltage scaled.At any given runtime voltage, the fault map directly controls power gate transistors which disable blocks that are unreliable for further power savings.Meanwhile, the cache controller prohibits valid data from being placed in a faulty block.From the software's perspective, the cache capacity is reduced at low voltage, causing more misses, but otherwise the cache operates correctly with good power savings.However, the yield could be affected since each set requires at least one non-faulty block at all runtime voltages.Figure 2 illustrates the benefit of our power/capacity scaling approach compared with power gating and FFT-Cache [36] (one of our recent FTVS works), for trading off power and capacity.This is despite the inability of the proposed power/capacity scaling method to achieve the lowest voltage at any yield target (Figure 3), motivating further studies in this direction.
We proposed in [37] two policy variants of power/capacity scaling: static (SPCS) and dynamic (DPCS).SPCS allows the system software or cache controller to choose the optimal cache voltage at boot time, based on knowledge of faulty blocks gained through BIST, to achieve a minimum of 99% fault-free blocks.
While SPCS is simple and can greatly reduce the voltage guardband, it ignores the opportunity for even better energy savings through cross-layer hardware/software optimization.DPCS allows the system to adapt the cache VDD at runtime in response to varying workload behaviors.In [37] we had the cache controller adapt the voltage in response to changing miss rates and an estimate of the miss penalty.When many misses were encountered at low voltage, the controller raises VDD to make more blocks available for use and thereby reduces capacity and conflict misses.When few misses are encountered, the controller reduces VDD to opportunistically save power.
Higher level semantics can mitigate the effect of the reduced cache size on performance (e.g., by simply increasing powerand hardware reliability -in phases of execution where the cache is fully utilized) or more interestingly, by using the higher-level information to adapt the organization/utilization of the data so as to minimize misses given the faulty-cache configuration.More sophisticated cross-layer policies are part of our ongoing work.With knowledge of the power/capacity scaling mechanism and particular cache operating points, software could be optimized at compile-time or runtime to improve energy efficiency with minimal performance degradation.

MEMORY AGING AND WEAROUT
We now review sample efforts that cope with wearout in memories and their limited lifetime at different levels of abstraction.As with resilient caches, higher-level semantic retention can help, by using information about how different program and algorithm-level structures are utilized (frequency of access, of reads of writes, their mappings at bank or cache level, etc. in different phases of program execution) to both increase efficiency of execution in the presence of faults, and to alleviate the expense of recovery mechanisms in software or hardware.We also illustrate how program characteristics can be exposed to the hardware in order to mitigate wearout effects, using the example of large GPGPU register files.

Wearout Mechanisms and Their Effects
Wearout mechanisms are different depending on the type of the memory family.While electron tunneling degrades the oxide layer in flash memory cells, SRAM is threatened by negative-bias temperature instability (NBTI) which weakens the drive current of PMOS devices.Furthermore, wearout effects are also different for each memory type.Wearout in NVMs limits the number of reliable writes.In SRAM, it decreases the stability of cells, especially for the read operation.Although wearout in NVMs is typically irreversible, SRAM wearout is partially recoverable.

Improving NVM Write Endurance
Traditional memory management techniques are write-variation oblivious and therefore cause parts of the memory to reach its maximum write count much earlier than the rest.Thus, most approaches for enhancing write endurance of NVMs are based on two ideas: (1) uniformly distributing writes over the whole memory space, and (2) reducing the number of write operations.

Flash as Main Memory
Approaches for wear-leveling in flash memories fall into two categories.First, dynamic wear leveling (DWL) techniques look at all of the available blocks that are free and select the one with the lowest erase count for next write.However, they do not move cold data afterwards [38].Second, static wear leveling (SWL) techniques try to prevent cold data from staying at any block for a long period of time.If the difference between two blocks' erase counts is too large, SWL starts erasing young blocks by moving cold data away from them [39].

PCM as Main Memory
Architectural Level Solutions: Flip-N-Write [40] is a microarchitectural technique that performs a read-before-write to decide whether to write the original data or its flipped version depending on which causes fewer bit flips.This is transparent to the rest of the system and the memory device takes care of inverting data whenever required.The authors in [41] consider manufacturing variation, which causes the programming current to be adjusted, based on the most difficult-to-reset cell.Instead of sacrificing lifetime of other cells, they use a lower programming current through Fine-Grained Current Regulation, allowing difficult-toreset cells to be recovered by error correcting pointers (ECP).[42] for hybrid PCM and DRAM memories.The operating system's page manager uses the page-level access frequency of PCM pages, tracked by hardware, in order to perform wear leveling.The OS also tries to swap hot pages from PRAM to DRAM.By changing the memory controller, the TLBs, and the operating system, the authors of [43] dynamically form clean pages out of pages with faulty bits.This enables continued operation through graceful degradation when cells fail.

Application Level Solutions:
A recent work by Sampson et al. [44] offers a new perspective for improving PCM lifetime.Through annotations, the application developer can identify some program variables as candidate for approximate storage.Hardware exploits this by reducing number of programming pulses for that part of physical memory that holds this data.In addition, even failed cells are used for storing approximate data.

PCM as On-Chip SPM
HaVOC [66] uses a combination of programmer annotations and a data volatility metric to simultaneously save energy while increasing the lifetime of NVMs.The volatility metric measures write frequency of a piece of data over its accumulated lifetime.Variable annotations are used to pass this metric to the run-time system, allowing the SPM manager to prioritize mapping of data with higher write frequency to be put in on-chip SPM.Thus by reducing the write operations to NVM, not only is the energy consumption of SPM reduced, but also its life-time is increased.

ReRAM as On-Chip
Last-Level Cache [45] proposes inter/intra-set write variation-aware cache policy (i 2 WAP) for ReRAM caches.Using address remapping, it uniformly distributes cache writes between all of the cache sets.This solves the problem of inter-set variation but within a set, hot cache lines are accessed more frequently because of locality.To solve this, i 2 WAP slightly modifies the Least Recently Used (LRU) replacement policy by intelligently writing back hot data at some timestamp and invalidating the corresponding line.The invalidated line would be a candidate for the next replacement, possibly for cold data.[46] proposes Dynamic Indexing for SRAM caches.The authors observe that in a partitioned cache architecture, some of the partitions are idle during most of the application execution time, while some others are accessed more.They exploit this behavior by putting idle partitions in drowsy mode (i.e., drooped VDD).This slows down the wearout of SRAM cells in those partitions.Also the cache indexing function is changed over time in order to uniformly distribute the idleness over all of the partitions.

Software Level Solution for SRAM SPMs
[47] presents a library of C-functions for wearout-aware data allocation on physically-banked SPMs.For data allocation, SPM_malloc calls the SPM controller which is aware of the current wearout status of each bank.This controller distributes allocation requests over the SPM banks in such a way that all banks could spend the same amount of time in drowsy mode.

Register File Aging Case Study: ARGO
Extreme multithreading with fast thread switching in GPGPUs is supported by large register files (RFs) that are much larger than on-chip caches holding the execution state of each thread.To protect these register files against NBTI, ARGO [48] exposes program characteristics to the hardware in order to design a lowoverhead stress distributer.In ARGO's flow (Figure 4), the OpenCL compiler embeds some metadata in binary code, including number of required registers for that kernel and its maximum amount of required memory.Based on this information, the host CPU at runtime decides on how many threads to assign to each workgroup.Depending on the kernel requirements and resource limitations not all of the available register file space can be used.On average, 46% underutilization is observed for execution of 15 common general purpose kernels.In such a flow, the compiler helps the underlying hardware by letting it know how much of the register space is required by the software.The RF allocator then power-gates unused parts of the register file, thereby not only saving leakage power, but more importantly ameliorating aging by putting that part in NBTI recovery mode.Furthermore the RF allocator employs a virtual sensing approach to estimate the aging profile of different RF banks in a relative manner.Based on that, and without any need of having on-chip NBTI sensors, it circulates the allocated space in the entire physical space of RF over time to enhance the RF lifetime.
DRAMs were observed to have over 20% read/write power variation [63] which was leveraged in [64] by dynamically adapting virtual to physical address mapping in the Linux operating system.The approach preferentially allocates frequently accessed data on to lower power memory DIMMs.SRAM arrays are known to have large variations which limit their minimum operating voltage and hence power.[15] achieves reliability through redundancy by optimizing RAID-like policies tuned for on-chip distributed scratchpad memories at lower power cost than ECC with voltage overscaling.Extending this, [55] allows programmers to partition their application's address space (through annotations) into virtual address regions and create mapping policies for each region depending on their requirements (fault tolerance, power, etc).In the cache context, FFT-Cache [36] uses sophisticated fault tolerance schemes in cache organization to achieve a lower operating voltage, while [37] described earlier does this using simple fault tolerance mechanisms for lower overheads.Measurements show systematic variation in program latency within and across multi-level flash devices [65].[62] extends conventional flash translation layers to schedule flash program operations on pages based on operations performance requirements and specific pages' performance characteristics.
Based on the observation that, for multi-level cell flash, whenever a cell error occurs, with high probability only one bit in the cell has error, [61] proposed an error correcting code based on generalized tensor products.
The increasing fraction of memory real estate coupled with emerging memory technologies with varying variability mechanisms make architecture and software-level handling of memory variations an integral part of the Variability Expedition.

SUMMARY AND CONCLUSIONS
In this paper, we highlighted efforts and opportunities for achieving memory resiliency both within and across multiple layers of the abstraction stack.To enable cross-layer memory resilience, it is important to understand the abstractions of memories, manifestations of memory errors and memory vulnerability at multiple levels.Our paper gave a sampling of these memory issues within the context of complex SoC designs.We also used two exemplars (resilient caches and memory aging) to illustrate multi-layer strategies for enhancing resilience.
While traditional memory resilience efforts have focused primarily on hardware, it is increasingly important to develop software-enabled mechanisms for managing memory resilience.
Moving forward, we should see efforts that synergistically combine hardware and software approaches to overcome adverse effects of memory failures, and also which opportunistically exploit application semantics for achieving more efficient designs, particularly in the context of applications that tolerate some level of quality degradation (e.g., approximate computing).System designers will need abstractions, tools, and methods to enable effective exploration of the memory resiliency design space.

Figure 5 .
Figure 5.The Underdesigned and Opportunistic Computing vision of the NSF Variability Expedition [49].