The cost of transferring data between the off-chip memory system and compute unit is the fundamental energy and performance bottleneck in conventional multi-core computing systems. Furthermore, in the era of big data and with the advent of emerging data-intensive applications, such as graph processing, machine learning, deep learning, media processing, data mining, computer vision, computational biology, and speech recognition, this bottleneck has continuously increased. For such applications, the expensive data movement between memory and compute unit dominates both execution time and energy/power consumption which results in impeding future performance scaling. Moreover, the technology scaling (the end of Moore's law and failure of Dennard scaling) has made all compute units energy and power constrained. In order to satisfy the energy and power constraints, researchers are forced to stop further increasing the frequency and to reduce the chip utilization. Thus, to continue scaling the performance, energy overhead must be minimized for every operation. To overcome these difficulties, different approaches either algorithmic-level or architectural-level can be applied. The later promising approach commonly referred to as Near Memory Processing (NMP) has become a potential and practical technology to transform the computation-centric systems towards memory-centric systems. The introduction of 3D die stacking technology and more importantly hybrid memory systems have revolutionized the concept of NMP. 3D die stacking, built using Through-Silicon Via (TSV), offers higher bandwidth, shorter wire lengths, lower power (due to short-length low-capacitance wires), and better performance compared to traditional 2D planner memories. This memory technology allows architects to implement practical NMP systems by vertically stacking multiple memory layers on top of a logic die in the same package. The logic layer is typically the most bottom layer which provides an area for adding a wide range of processing logic (general-purpose cores, FPGAs, ASICs, or a combination of all types). It enables higher density many-core architectures to happen and helps for improving the power-performance characteristics to increase capabilities of modern integrated circuits.
The focus of this dissertation is to explore and evaluate the feasibility and efficacy of NMP architecture constructed based on an emerging Non-Volatile Memory (NVM) technology in a 3D structure.
And to compare it with the conventional NMP architecture built based on 3D-DRAM in terms of performance and power consumption. To this purpose, first, a set of NMP-centric performance metrics are redefined in order to analyze the efficacy of mapping a given processing unit to a specific application. Leveraging the proposed metrics, a comprehensive characterization is conducted on a wide range of multi-threaded applications (various computation and memory patterns) from different domains as a case study to reveal their performance bottleneck. Then, two different NMP architectures are explored and the impact of constructing NMP architecture based on an emerging non-volatile memory technology (3D-NVM) is analyzed.
Also the feasibility of having an NMP subsystem on a hybrid 3D memory system is motivated in this dissertation.
Finally, the experimental results demonstrate that executing certain data-intensive (memory-intensive) applications on the evaluated NMP architectures (3D-PCM and 3D-DRAM) improve the performance by 1.3x to 5x and reduce memory power/energy consumption by an average of 47% compared to executing them on conventional multi-core Host CPU system. These improvements make the hybrid NMP system a great design technique for acceleration in performance and power across a wide range of data-intensive applications.