The growing computing demands of emerging application domains such as Recognition/Mining/Synthesis (RMS), visual computing, wearable devices and the Internet of Things (IoT) has driven the move towards manycore architectures to better manage tradeoffs among performance, energy efficiency, and reliability.
The memory hierarchy of manycore architectures has a major impact on their overall performance, energy efficiency and reliability. We identify three major problems that make traditional memory hierarchies unattractive for manycore architectures and their data-intensive workloads: (1) they are power hungry and not a good fit for manycores in face of dark silicon, (2) they are not adaptable to the workload's requirements and memory behavior, and (3) they are not scalable due to coherence overheads.
This thesis argues that many of these inefficiencies are the result of software-agnostic hardware-managed memory hierarchies. Application semantics and behavior captured in software can be exploited to more efficiently manage the memory hierarchy. This thesis exploits some of this information and proposes a number of techniques to mitigate the aforementioned inefficiencies in two broad contexts: (1) explicit management of hybrid cache-SPM memory hierarchies, and (2) exploiting approximate computing for energy efficiency.
We first present the required hardware and software support for a software-assisted memory hierarchy that is composed of distributed memories which can be partitioned between caches and software-programmable memories (SPMs) at runtime. This memory hierarchy supports local and remote allocations and data movements between SPM and cache and also between two physical SPMs. The distributed SPM space is shared between a mix of threads where each thread explicitly requests SPM space throughout its execution. The runtime component of this hierarchy shares the entire distributed SPM space between contending threads based on an allocation policy. Unlike traditional memory hierarchies, we incorporate no coherence logic in this hierarchy. The program explicitly allocates the shared data on the distributed SPM space. For all threads of that program, the accesses to shared data are forwarded to the same physical copy.
Next, we augment caches and SPMs in this hierarchy with approximation support in order to improve the energy efficiency of the memory subsystem when running approximate programs. We present approximation techniques for major building blocks of our hybrid cache-SPM memory hierarchy. We introduce Relaxed Cache as an approximate private L1 SRAM cache where the quality, capacity, and energy consumption of this cache are controlled through two architectural knobs (i.e., voltage and the number of acceptable faulty bits per cache block). We then present QuARK Cache, an approximate shared L2 STT-MRAM cache. The read and write current amplitude provide two knobs to make a tradeoff between the accuracy of memory operations and the dynamic energy consumption. We then introduce Write-Skip, a technique that skips write operations in STT-MRAM data SPMs if the previous value and the new value are approximately equal. Finally, we discuss a quality-configurable memory approximation strategy using formal control theory that adjusts the level of approximation at runtime depending on the desired quality for the program's output.
We implemented all software and hardware components of the proposed software-assisted memory hierarchy in the gem5 architectural simulator. Our simulations on a mix of RMS and microbenchmarks show that our proposed techniques achieve better performance, energy, and scalability for manycore systems over traditional hardware-managed memory hierarchies.