Lin, Zhongyi

Performance Modeling and Optimization for Machine Learning Workloads

2023

Lin, Zhongyi
Advisor(s): Owens, John D.

Abstract

Machine learning (ML) workloads emerge and evolve drastically in a series of aspects in recent years. ML workloads' performance, i.e., training/inference speed on various devices/platforms, stands as one of the top considerations in their development. Performance modeling is a powerful technique that helps ML practitioners understand the performance bottlenecks of ML workloads and optimize them. In this dissertation, we showcase how to use performance models to assist in the optimization of ML performance, and how we design such models that are highly accurate, robust, and versatile with different application configurations, such as training/inference, ML model types, and device types.

We first show how to use the roofline model as a simple operator (op) level performance model to identify kernel/layer fusion candidates in convolution neural networks (CNN). We answer the question of when and why fusing two linearly connected complex ops, i.e., convolution (conv) and depthwise convolution (dw-conv) in an ML model will be beneficial in terms of execution time, and propose a deep learning (DL) compiler friendly solution that enables efficient auto-tuning of fused kernel schedule of two layers on multicore CPUs and beat the separate kernel execution performance of TVM (by 1.09x geomean and 1.29x max) MKLDNN-backed PyTorch (by 2.09x geomean and 3.35x max) and as end-to-end (E2E) baselines.

Next, we present a more complicated application of performance models in predicting and aiding the optimization of ML training performance on GPU platforms. Built on top of a series of kernel-level performance models, either ML-based or analytical, for dominating ops/kernels as well as the overhead analysis for all ops in the deep learning recommendation model (DLRM), we devise a critical-path-based performance model that not only predicts the per-batch training time of DLRM on single GPU with low error rate (geomean: 4.61% for GPU active time, 7.96% for E2E, and 10.15% for E2E with shared overheads) but can also be generalized to other types of ML models such as computer vision (CV) and natural language processing (NLP).

Finally, We further extend this performance model to multi-GPU platforms by adding supports to 1) communication collective performance modeling, 2) GPU stream synchronizations on the same device and across devices in the E2E time prediction algorithm, and 3) data-distribution-aware and problem size flexible performance modeling of embedding table lookup. On single-node multi-GPU platforms, this enhanced model exhibits robustness on DLRM models with random embedding tables, maintains low training speed prediction error (geomean: 5.21% for E2E with shared overheads on randomly generated DLRMs), and generalizes well to NLP models with 3.00% geomean prediction error. With a use case, we demonstrate its ability to quickly select the embedding table sharding configuration and thus improve the end-to-end training performance of DLRMs.

UC Davis

Performance Modeling and Optimization for Machine Learning Workloads