Architecture Support for Customizable Domain-Specific Computing
This dissertation investigates the power-efficient high-performance architecture support for customizable domain-specific computing at both memory and communication levels in a customizable heterogeneous platform (CHP).
In domain-specific computing, the memory access pattern can be obtained through offline analysis. With this knowledge, the cores and the accelerators in the CHP can use on-chip scratchpad memory (SPM) and buffers to directly manage the data replacement in order to save off-chip memory bandwidth. We propose efficient schemes to hybrid the SPM and primary caches, and to also hybrid buffers and the shared last-level cache (LLC). In the hybrid primary cache, due to its low associativity, the problem of balancing the cache set utilization when the SPM is allocated in the cache is critical. We propose an adaptive hybrid cache (AH-Cache) to dynamically remap SPM blocks from high-demand cache sets to low-demand cache sets. In the hybrid LLC (typically designed as a nonuniform cache architecture, NUCA), the problem of resource contention and fragmentation becomes crucial. We propose a buffer-in-NUCA (BiN) scheme to assign shared buffer spaces to accelerators that can best utilize the additional buffer space, and use flexible paged buffer allocation to limit the impact of buffer fragmentation.
In domain-specific computing, the communication pattern can be also obtained through offline analysis. With this knowledge, the topology and routing scheme in the CHP communication subsystem can be customized to dynamically adapt to the known communication pattern. For the topology customization, we propose application-specific shortcuts and multicast realized by radio frequency interconnects (RF-I) overlaid network-on-chip (NoC). At runtime, we can flexibly allocate RF-I bandwidth to adapt the NoC topology to the known communication requirement of an application. For the routing customization, we propose an power-efficient application-specific cycle elimination and splitting (ACES) routing scheme to avoid restricting the critical routes of an application while achieving deadlock-free for irregular NoCs.
To further demonstrate the feasibility and effectiveness of these techniques, we develop a FPGA prototype of the proposed CHP with shared accelerators and buffers. The buffer sharing is achieved through a cost-efficient partial-crossbar to reduce the sharing overhead on timing and area.