Graph Convolutional Networks (GCNs) have shown great results but come with large computation costs and memory overhead. Recently, sampling-based approaches have been proposed to alter input sizes, which allows large GCN workloads to align to hardware constraints. Motivated by this flexibility, this thesis proposes an FPGA-based GCN accelerator, along with a novel sparse matrix format and multiple software-hardware co-optimizations to improve training efficiency. First, all feature and adjacency matrices of GCN are quantized from 32-bit floating point to 16-bit signed integers. Next, the non-linear operations are simplified to better fit the FPGA computation, and reusable intermediate results are identified and stored to eliminate redundant computation. Moreover, a linear-time sparse matrix compression algorithm is employed to further reduce memory bandwidth, while allowing efficient decompression on hardware. Finally, a unified hardware architecture is proposed to process sparse-dense matrix multiplication (SpMM), dense matrix multiplication (MM) and transposed matrix multiplication (TMM), all on the same group of PEs to maximize DSPutilization on FPGA.
Evaluation is performed on a Xilinx Alveo U200 board. Compared with existing FPGA-based accelerator on the same network architecture, the new accelerator achieves up to 11.3� speedup while maintaining the same training accuracy. It also achieves up to 178� and 13.1� speedup over state-of-art CPU and GPU implementation on popular datasets, respectively.