Architectural Techniques to Enhance the Efficiency of Accelerator-Centric Architectures
- Author(s): Hao, Yuchen
- Advisor(s): Reinman, Glenn D
- et al.
In light of the failure of Dennard scaling and recent slowdown of Moore's Law, both industry and academia seek drastic measures to sustain the scalability of computing in order to meet the ever-growing demands. Customized hardware accelerator in the form of specialized datapath and memory management has gained popularity for its promise of orders-of-magnitude performance and energy gains compared to general-purpose cores. The computer architecture community has proposed many heterogeneous systems that integrate a rich set of customized accelerators onto the same die. While such architectures promise tremendous performance/watt targets, our ability to reap the benefit of hardware acceleration is limited by the efficiency of the integration.
This dissertation presents a series of architectural techniques to enhance the efficiency of accelerator-centric architectures. Staring with physical integration, we propose the Hybrid network with Predictive Reservation (HPR) to reduce data movement overhead on the on-chip interconnection network. The proposed hybrid-switching approach prioritizes accelerator traffic using circuit-switching while minimizes the interference caused to regular traffic. Moreover, to enhance the logical integration of customized accelerators, this dissertation presents an efficient address translation support for accelerator-centric architectures. We observe that accelerators exhibit page split phenomenon due to data tiling and immense sensitivity to address translation latency. We use this observation to design two-level TLBs and host page walk to reduce TLB misses and page walk latency, which provides within 6.4\% of ideal performance. Finally, on-chip accelerators are only part of the entire system. To eliminate data movement across chip boundaries, we present the compute hierarchy which integrates accelerators to each level of the conventional memory hierarchy, offering distinct compute and memory capabilities. We propose a global accelerators manager to coordinate between accelerators in different levels and demonstrate its effectiveness by deploying a content-based image retrieval system.
The techniques described in this dissertation demonstrate some initial steps towards efficient accelerator-centric architectures. We hope that this work, and other research in the area, will address many issues of integrating customized accelerators, unlocking end-to-end system performance and energy efficiency and opening up new opportunities for efficient architecture design.