Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Acceleration of Deep Learning Algorithms with Transformers

No data is associated with this publication.
Abstract

Deep learning algorithms are getting increasingly popular in our daily life and have achieved extensive success on many applications. Starting from convolutional neural networks (CNNs), the focus of deep learning algorithms gradually shifts to Transformer-based models, during which models are getting more complex and larger. Specifically, operations and shapes are getting more diverse, while the model size and operation count are rapidly getting larger. All pose challenges on hardware design on how to handle the diversity and scalability from deep learning algorithms.

We start from domain-specific FPGA-based overlay processor unit (OPU) and extend OPU for dynamic shapes in Transformer-based models. We propose a reconfigurable systolic array with multi-level packing in three folds to handle the variable length of input sequences during multi-head attention. First, matrix multiplications for different heads can be packed along the array columns to improve spatial efficiency. Meanwhile, for temporal efficiency, we develop a coarse-grained pipeline for attention, where stages can run on different parts of the array at the same time. We further exploit the computation redundancy from the causal masking in the Transformer decoder with runtime reconfigurable inter-PE connection and buffer switching. Applied to GPT, our FPGA design has achieved 1.16x higher normalized throughput and 1.94x better runtime MAC utilization over the state-of-the-art GPU performance for variable-length input sequences from GLUE and SQuAD dataset.

Next, we scale up our processor to a multi-core system (MCore-OPU) for larger Transformer-based models with optimizations for intra-core computation and inter-core communication. First, we boost the operating frequency of the processing element (PE) array to double the rest of the processor to improve the intra-core throughput. Second, we develop on-chip synchronization routers to reduce expensive off-chip memory traffic, where only the partial sum and maximum are communicated between cores rather than entire vectors for layer normalization and softmax. Moreover, we pipeline synchronization to reduce synchronization latency and develop a bypass of the interconnect bus to reduce the off-chip memory access latency. Finally, we optimize the multi-core model allocation and scheduling to minimize the inter-core communications and maximize the intra-core computation efficiency. MCore-OPU is implemented with four cores and four DDRs on the Xilinx U200 FPGA, where the PE array runs at 600MHz while the rest runs at 300MHz. Experimental results show that the MCore-OPU in 8-bit integer outperforms other FPGA-based accelerators by 2.82x--13.28x and A100 GPU by 2.91x--4.60x in throughput per DSP for BERT, ViT, GPT-2 and LLaMA inference, respectively.

After that, we study how to handle the sparsity in pruned large language models (LLMs). We propose ChatOPU for unstructured model pruning and aim to improve the data reuse on a systolic array. First, we propose a new diagonal dataflow on a systolic array to obtain efficient data reuse for both sparse and dense matrix multiplication. Second, we develop efficient encoding and decoding for the sparse parameters to save off-chip memory traffic. Moreover, we boost the off-chip bandwidth utilization with pinned on-chip KV cache allocation and coalesced access throughout the LLM inference. Experimental results show that ChatOPU on Xilinx U200 FPGA outperforms GPU and other FPGA-based accelerators by 2.29x and 1.63x on LLMs with unstructured sparsity across different input and output sequence lengths.

We also study how to optimize our system for heterogeneous CNNs.We propose a runtime reconfigurable heterogeneous overlay processor with three types of PEs. The first two are optimized for normal and depthwise convolutions without runtime reconfiguration. The third one is runtime reconfigurable to either normal or depthwise convolution by fully reconfiguring FPGA fabric for the PE. To reduce reconfiguration overhead, we develop scheduling and allocation to maximize the throughput of a mixed workload, and also decide the optimal ratio between PEs at the design time. For a workload including MobileNetV1, MobileNetV2, and ShuffleNetV1 and the same FPGA area/clock/memory-bandwidth, our design is respectively 30\% and 18\% better than overlays with homogeneous PEs and with heterogeneous PEs where PE ratio is optimized during design time for each specific network.

To summarize, we have been working on FPGA-based overlay processors for CNNs and Transformer-based models. The processors are hand-coded and implemented on FPGA. We develop a compiler toolchain to parse deep learning models from PyTorch, apply processor-specific scheduling and allocation, followed by code generation to our processors.

Main Content

This item is under embargo until March 11, 2025.