The exponential growth of the dataset size demanded by modern big data applications requires innovative server architecture design for data centers. A server workload comprise of a diverse set of data-intensive and compute-intensive applications. These applications often interact with in-memory or in-storage datasets and their throughput is crucial to the user experience. But today's commodity hardware solutions which serve such applications, often follow a compute-centric approach which lacks the adequate interconnection bandwidth for many of these use-cases and leads to substantial data transfer latency and energy consumption. One promising solution has been to offload part of the computation closer to the data medium. Various Near data processing (NDP) techniques, while targeting data at different levels of the memory hierarchy, share common benefits: higher aggregate bandwidth, concurrency in access and mitigating the energy cost of moving data.
While prior work have shown many benefits of NDP accelerators, they have not solved three of the main challenges: 1) Existing NDP accelerators are mostly early-stage prototypes with limited capability of system configuration. Thus, researchers lack an practical way to explore the design space and show the benefits of NDP techniques under different system parameters and applications. 2) Focusing solely on one level of the memory hierarchy (one of cache, main memory or storage) cannot provide a satisfying solution for the data center servers which serve a diverse range of applications. A good understanding of application characteristics and exploring their suitability for NDP acceleration is missing. 3) It is common to have variation in compute and memory requirements, even within different execution phases of a single application. Thus, a single application could benefit from the collaboration of different NDP accelerators, but the collective benefits has not been explored.
In this dissertation, We try to address these three challenges, by presenting a multi-level acceleration platform that combines on-chip, near-memory and near-storage accelerators, spanning all levels of the conventional memory hierarchy. Our simulation platform features compute levels with adjustable memory/accelerator parameters, thus offering a broad spectrum of acceleration options. To enable effective acceleration on various application pipelines, we propose a holistic approach to coordinate between the compute levels, reducing inter-level data access interference and asynchronous task flow control. To minimize the programming efforts of using the platform, a uniform programming interface is designed to decouple the configuration from the user application source code and allow runtime adjustments without modifying the deployed application. We use our simulation platform to quantify the collective performance and energy benefits of NDP in all levels of the memory hierarchy. We also present an in-depth study of workload characteristics on various class of applications including visual retrieval, database, finance and security to demonstrate that a proper application mapping could avoid unnecessary data movements and achieve significant improvements to performance and energy efficiency.