Yu, yunxuan

FPGA Overlay Processor for Deep Neural Networks

2020

Yu, yunxuan
Advisor(s): He, Lei

Abstract

The rapid advancement of Artificial intelligence (AI) is making our everyday life easier with smart assistants, automatic medical analyzer, bank plagiarism checkers and traffic predictions, etc. Deep learning algorithms, especially deep convolutional neuron networks (DCNNs), achieve top performance on AI tasks, but suffers from dense computational requirements, which calls for hardware acceleration. In this thesis we propose several architectures including compilation flow for general DCNN acceleration using FPGA platform.

Starting from late 2015 we began to design customized accelerators for popular DCNNs such as VGG and YOLOv2. We reformulate the convolution computation by flattening it to large-scale matrix multiplication between feature maps and convolution kernels, which can be computed as inner product. With this formulation, the accelerators across all layers can be unified to enhance resource sharing, and maximize utilization of computing resources. We also quantized the network into 8bit with negligible accuracy loss to reduce memory footprint and computation resources. Different parallelism optimization strategies are explored for different networks. The VGG16 accelerator achieved 1.15x throughput under 1.5x lower frequency compared with state-of-the art designs. The YOLOv2 accelerator was commercialized and employed for real-time subway X-ray auto-hazard detection.

Based on the experience we gained through customized accelerator designing, we designed a RTL compiler as an end-to-end solution to automatically generate RTL design for given CNN network and FPGA platform, which greatly reduced the human effort in developing a specific network accelerator. The compiler applies analytical performance models to optimize parameters for modules based on a handwritten template library, such that the overall throughput is maximized. Several levels of parallelism for convolution are explored, including inter feature-map, intra-kernel-set, input/output channel, etc. We also optimize architectures for block RAM and input buffers to speed up data flow. We tested our compiler on several well-known CNNs including AlexNet and VGGNet for different FPGA platforms. The resulting AlexNet is 113.69 GOPS on Xilinx VCU095 and 177.44 GOPS on VC707, and VGGNet is 226 GOPS on VCU095 under 100MHZ. These are 1.3x, 2.1x and 1.2x better than the best reported FPGA accelerators at that time, respectively.

However, network-specific accelerator requires regeneration of logic and physical implementation whenever network is updated. Moreover, it certainly cannot handle cascaded network applications that are widely employed in complex real-world scenarios. Therefore, we propose a domain-specific FPGA overlay processor, named OPU to accelerate a wide range of CNN networks without re-configuration of FPGA for switch or update of CNN networks. We define our domain-specific instruction architecture with optimized granularity to maintain high efficiency while gaining extra progammability. We also built hardware micro-architectures on FPGA to verify ISA efficiency, and a compiler flow for parsing, optimization ans instructuin generation. Experiments show that OPU can achieve an average of 91% run-time MAC efficiency (RME) among various popular networks. Moreover, for VGG and YOLO networks, OPU outperforms automatically compiled network-specific accelerators in the literature. In addition, OPU shows 5.35x better power efficiency compared with Titan Xp. For a case using cascaded CNN networks, OPU is 2.9x faster compared with edge computing GPU Jetson Tx2 with a similar amount of computing resources. Our OPU platform was employed in an automatic curbside parking charging system in real-world.

Using OPU as base design, we extend different versions of OPU to handle the newly emerged DCNN architectures. We have Light-OPU for light-weight DCNNs acceleration, where we modified the OPU architecture to fit the memory bounded light-weight operations. Our instruction architecture considers the sharing of major computation engine between LW operations and conventional convolution operations. This improves the run-time resource efficiency and overall power efficiency. Our experiments on seven major LW-CNNs show that Light-OPU achieves 5.5x better latency and 3.0x higher power efficiency on average compared with edge GPU NVIDIA Jetson TX2. Moreover, we also have Uni-OPU for the efficient uniform hardware acceleration of different types of transposed convolutional (TCONV) networks as well as conventional convolution (CONV) networks. Extra stage in compiler would transform the computation of Zero-inserting based TCONV (Zero-TCONV), nearest-neighbor resizing based TCONV (NN-TCONV) and CONV layers into the same pattern. The compiler conducts the following optimizations: (1) Eliminating up to 98.4% of operations in TCONV by making use of the fixed pattern of TCONV upsampling; (2) Decomposing and reformulating TCONV and CONV processes into streaming paralleled vector multiplication with uniform address generation scheme and data flow pattern. Uni-OPU can reach throughput up to 2.35 TOPS for TCONV layer. We evaluate \textit{Uni-OPU} on a benchmark set composed of six TCONV networks from different application fields. Extensive experimental results indicate that Uni-OPU is able to gain 1.45x to 3.68x superior power efficiency compared with state-of-the-art Zero-TCONV accelerators. High acceleration performance is also achieved on NN-TCONV networks, whose acceleration have not been explored before. In summary, we observe 15.04x and 12.43x higher power efficiency on Zero-TCONV and NN-TCONV networks compared with Titan Xp GPU on average. To the best of our knowledge, we are the first in-depth study to completely unify the computation process of both Zero-TCONV, NN-TCONV and CONV layers.

In summary, we have been working on FPGA acceleration for deep learning vision algorithms. Several hand-coded customized accelerator as well as an auto-compiler that generates RTL code for customized accelerator have been developed. An initial tool-chain for an FPGA based overlay processor was also finished, which can compile DCNN network configuration file from popular deep learning platforms and map to processor for acceleration.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

FPGA Overlay Processor for Deep Neural Networks