- Main
Accelerating Attention Models on Hardware
- Li, Zheng
- Advisor(s): Kang, Mingu
Abstract
The attention mechanism is the key to many state-of-the-art transformer-based models in Natural Language Processing and Computer Vision. These models are pretrained on large datasets and the model size is growing rapidly. At the same time, the computation and data movement cost and the on-chip memory demand is also growing beyond the capabilities of edge devices. This thesis provides solutions to address these challenges by developing strategies to prune the inconsequential attention scores efficiently and effectively. Attention score is the core of the atten- tion mechanism in all transformer-based models. It measures the correlation of two tokens in a sequence. Low attention score value indicates unimportant correlation and minimal impact in subsequent calculation. In Chapter 2, a novel gradient- based method to find the optimal threshold to prune the inconsequential attention scores to reduce the computation cost is introduced. Based on the work in 2, an accelerator that features in-memory pruning of attention scores is introduced in Chapter 3. Result shows these pruning strategies achieve high speedup, low energy consumption while maintain the accuracy across different transformer models on various benchmarks.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-