- Main
Orchestration Systems to Support Deep Learning at Scale
- Nagrecha, Kabir
- Advisor(s): Kumar, Arun;
- Zhang, Hao
Abstract
Deep learning (DL)’s dramatic rise in popularity across the domain sciences and industry has been accompanied by a correspondingly aggressive increase in the scale and computational complexity of DL workloads. In order to adopt state-of-the-art techniques, practitioners must wrestle with systems challenges of performance, cost, and scalability. In this dissertation, we identify the need for orchestration systems, which ease scaling burdens across the DL lifecycle through holistic, workload-aware optimizations. Drawing on both established techniques from data management research and new bespoke algorithms, we build practical orchestration engines to optimize three common DL workloads in the large-scale setting: model selection, data processing, and high-throughput serving. Our systems — which exploit workload- and context- specific opportunities — address a new layer of the large-scale DL optimization stack, more granular than current cluster managers and data systems, but still abstracted away low-level kernel & compiler optimizations. Empirical evaluations show that our orchestration techniques and systems can accelerate large-scale DL workloads by a large margin, even in complex, real-world settings. Our approach introduces a new technical lens, unifying systems, databases, and DL research, ultimately focused on democratizing and amplifying state-of-the-art DL innovations. Some of the systems proposed in this dissertation have already been adopted in production-scale industry pipelines, demonstrating the value of such orchestration optimizers for real-world DL.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-