In many domains, accelerators---such as graphic processing units (GPUs) and field programmable gate arrays (FPGAs)---provide a significantly higher performance than general-purpose processors and at a much lower power. Accelerator-rich architectures are thus much more energy-efficient and are becoming mainstream.
This dissertation investigates two important keys to the performance and power efficiency of accelerator-rich architectures---resource and data management. Three broad classes of accelerator-rich architectures are considered: chip-level accelerator-rich architectures such as systems-on-chips(SoC), node-level accelerator-rich architectures, and cluster-level accelerator-rich architectures.
We first study SoC resource management for a broader class of streaming applications. On accelerator-rich SoCs, where multiple computation kernels space-share a single chip, we target the exploration of tradeoffs of on-chip resources and system performance, and find the best combination of accelerator implementations and data communication channel implementations to realize the application functionality.
We continue our study of node-level accelerator-rich architectures where we consider orchestrating two kinds of computation resources, CPU and accelerator, in the PCIe-integrated CPU-accelerator platform and explore the CPU-FPGA collaboration approach to improve application performance.
Then we study the resource allocation problem on accelerator-rich clusters, where accelerators are time-shared among multiple tenants. Unlike traditional cluster resource management, we propose to consider accelerators as the first-class citizen in the cluster resource pool, and develop an accelerator-centric resource scheduling policy to enable fine-grained accelerator sharing among multiple tenants.
Finally, we investigate data shuffling on accelerator-rich clusters and evaluate the possibility of using accelerators during data shuffling. We find that although data shuffling involves a large amount of computation, using accelerators does not necessarily improve system performance due to the data serialization and deserialization overhead introduced by accelerators.