- Main
Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications
- Zhai, Yujia
- Advisor(s): Chen, Zizhong
Abstract
Efficient performance is essential for deploying a system in the real world. This thesis presents techniques for optimizing performance with an awareness of architecture for applications ranging from foundational math libraries, such as Basic Linear Algebra Subprograms (BLAS), to cutting-edge applications like homomorphic encryption (HE) and deep learning (DL) inference for the transformer model.
First, we introduce FT-BLAS, a new implementation of BLAS that offers high reliability and superior performance compared to other libraries, including Intel MKL, OpenBLAS, and BLIS. FT-BLAS is capable of tolerating soft errors on-the-fly, making it more robust than other libraries. The experimental results of FT-BLAS on Intel Skylake, Intel Cascade Lake, and AMD Zen2 processors demonstrate its high performance, being up to 3.50%, 22.14%, and 21.70% faster than Intel MKL, OpenBLAS, and BLIS, respectively.
We then present XeHE, a HE library accelerated for Intel GPUs. Our staged optimizations, including low-level optimizations and kernel fusion, accelerate the Number Theoretic Transform (NTT), a fundamental algorithm for HE, by up to 9.93X compared to the naive GPU baseline. Our optimized NTT reaches 79.8% and 85.7% of the peak performance on two GPU devices, and our systematic optimizations improve the performance of encrypted element-wise polynomial matrix multiplication applications by up to 3.11X.
Finally, we present ByteTransformer, an industrial transformer framework optimized for variable-length inputs. ByteTransformer has been deployed to serve TikTok and Douyin applications of ByteDance, and part of our proposed optimizations has been integrated into the production code base of NVIDIA. Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA outperforms the standard PyTorch MHA by 6.13x. ByteTransformer's end-to-end performance for a standard BERT Transformer model surpasses state-of-the-art transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer, Microsoft DeepSpeed-Inference, and NVIDIA FasterTransformer, by 87%, 131%, 138%, 74%, and 55%, respectively. We also demonstrate the general applicability of our optimization methods to other BERT-like models, including ALBERT, DistilBERT, and DeBERTa.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-