Specialized accelerators are increasingly attractive solutions to continue expected generational performance scaling with slowing technology scaling. Existing programmable accelerators like GPUs are limited to regular algorithms; however, supporting irregularity becomes necessary to accelerate more advanced algorithms or those from challenging domains.Irregularity occurs when aspects of the execution depend on the data -- these aspects can be control, memory, parallelism, and reuse. An example of data-dependent execution is joining two sparse lists where the branch outcome depends on data; this inherently couples computation with memory and precludes efficient vectorization -- defeating the traditional mechanisms of programmable accelerators (e.g., GPUs). Data dependencies in computations like producer-consumer streaming dependencies require detrimental synchronization overhead in shared memory systems.
We aim to develop a family of compatible dataflow accelerator mechanisms. Our critical insight is that it is unnecessary to support arbitrary forms of irregularity because specializable data dependence forms are sufficient for many workloads (despite being more restrictive). We expose these data dependence forms as first-class citizens in the instruction set architecture (ISA) so that the corresponding hardware is specialized for efficiency while remaining flexible to the requirements of applications.
This dissertation studies various applications from machine learning, graph processing, databases, and signal processing. First, we identified specializable forms for instruction and task-level irregularity (where a task is coarse-grained). These primitives are designed to be composable and extensible to enable continual evolution. Then, we used these forms to design a family of spatial architectures supporting irregularity with compatible features. Leveraging hardware flexibility, we also studied different algorithmic implementations and developed insights into the complex relationship between input, workload, and algorithmic choices on performance.
We evaluate the architecture by integrating a custom accelerator simulator into the cycle-level gem5 simulator. The input programs are written in a combination of C and intrinsics of traditional dataflow and our data dependence forms. The hardware components are implemented in Chisel with an industry 28-nm technology. Overall, our architectures achieve order-of-magnitude speedups over a similarly provisioned GPU while remaining within 30% of application-specific architectures. In graph processing, we achieved 5.7x speedup over the best prior domain-specific accelerator, using the flexibility to choose the optimal algorithm given the input workload and data characteristics. Our designed hardware can be a strong alternative to GPUs; the main compute offload accelerators today. Besides, our proposed features target irregular primitives and perform similarly to prior domain-specific architectures. We believe our work can be the first step to unifying the rapidly evolving irregular acceleration space.