Automatic Synthesis and Architecture Optimization of Systolic Arrays
A systolic array architecture consists of a grid of simple processing elements (PE) connected through local interconnects. With a massive number of PEs and a local interconnection, such an architecture is capable of achieving high performance and energy efficiency. My dissertation focuses on extending the research of systolic array architecture in two fields: automatic systolic array synthesis and architecture optimization. The first part of the dissertation focuses on the automated systolic array synthesis. Designing high-performance systolic arrays requires an understanding of both the application characteristics and hardware architecture, requiring non-trivial efforts to reap its benefits. There exists a large body of past works on developing compilation frameworks for systolic arrays. However, these works failed to reach a balance between the generality, performance, and productivity, making them hard to use in practice. Our work advances this field by leveraging two compilation technologies, the polyhedral model and high-level synthesis (HLS). We propose a new compilation framework, AutoSA, which is built on the polyhedral framework and is capable of generating high-performance systolic arrays on FPGA in HLS languages. We show that AutoSA can handle applications with complex dependence structures and generate designs with performance comparable to or better than manual designs. AutoSA incorporates a broad set of hardware optimization techniques that open up a vast design space which is intractable to explore manually. To cope with this challenge, we propose an efficient auto-tuning framework, Odyssey, which finds optimal designs within seconds. With both AutoSA and Odyssey, we reduce the development cycles of systolic arrays from weeks to days, which significantly boosts the productivity compared to the prior works. In the second part of the dissertation, we present two application optimization studies that deploy systolic arrays for various applications and platforms. The first study investigates the architecture trade-offs when using systolic arrays for one important application, convolutional neural network (CNN). The results show that a single monolithic systolic array is insufficient to handle the divergent characteristics of different CNN layers. Therefore, we further explore the use of a multi-array architecture that implements several smaller systolic arrays with different configurations customized for each CNN layer. Multi-array systems help improve the throughput with a cost of longer latency. This work reveals the complexities and trade-offs when mapping a real-world application to systolic arrays. In addition to FPGA, systolic arrays can also be mapped to GPU as an overlay above the existing GPU architecture. The second work investigates the performance trade-offs when mapping systolic arrays to GPU. We achieve a performance speedup by leveraging the shuffle instructions on Nvidia GPUs to implement the inter-PE communication compared to baselines using the shared memory. Systolic array architecture plays an important role in the post-Moore's law era as one architecture candidate capable of delivering high performance and energy efficiency. The works presented in this dissertation provide comprehensive and efficient solutions to lowering the programming efforts and optimizing the performance of this architecture. We hope the promising results from these works will open the door to more deployment cases of systolic arrays in a broader range of applications and hardware platforms in the future.