Dhakal, Aditya

Cooperative Design of Machine Learning and GPU-Based Systems for Inference

2022

Dhakal, Aditya
Advisor(s): Ramakrishnan, K. K.

Abstract

Our work seeks to improve and adapt computing systems and machine learning (ML) algorithms to match each other’s requirements and capabilities. Due to the high computational demand for Deep Neural Networks (DNNs), accelerators such as GPUs are necessary to achieve low-latency inference. With GPUs getting more powerful, DNNs often fail to utilize a GPU’s parallelism fully. Understanding the GPU resources a DNN model can effectively use while still fulfilling the SLO of users allows us to run multiple DNN models concurrently by spatially sharing the GPU. Our DNN inference framework virtualizes the GPU, adapts to the DNN model, and improves the CPU-GPU coordination, to achieve a much higher aggregate inference throughput compared to other multiplexing techniques. While spatial sharing utilizes the GPU’s resources better, its static resource allocation is unsuitable for dynamic workloads. We propose a spatio-temporal scheduler that provides the right GPU resources for multiple DNNs and meets the inference tasks’ SLOs. Our spatio-temporal scheduler can run more models concurrently and achieve 4 × higher throughput than other scheduling techniques.

Autotuning a DNN model customizes it to match the system’s capabilities. However, current autotuning frameworks do not consider a DNN model’s GPU demand. They produce a sub-optimally tuned model that does not provide the lowest possible latency when inferring with a smaller amount of GPU resources. Our enhanced autotuning produces a tuned model resilient to different amounts of GPU resources available at inference by targeting the appropriate GPU resources during tuning. Our framework enables concurrent autotuning of multiple models by using spatial multiplexing of the GPU, and eliminates several system overheads, thus decreasing tuning time by more than 70%.

We apply our understanding of GPU to the task of localization in multi-user augmented reality (AR) applications. Our multi-user AR framework efficiently offloads AR computation to an edge server and lowers the localization time by 50%. Our implementation utilizes shared memory on the edge server to reduce the time for ’map merging’ between different users’ maps, a task that needs to be performed frequently with multi-user AR. Our approach cuts the map merging time by more than 80% compared to potential multi-user AR approaches.

Our overall approach is to adapt computing resources to the algorithms that use them. It particularly benefits current ML algorithms and other applications that use these accelerators, such as AR. Our techniques can also improve the throughput of the newer generations of accelerators, which offer a significant speedup by using parallel compute engines.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Riverside

Cooperative Design of Machine Learning and GPU-Based Systems for Inference