Cerebro - An Efficient, End-to-End Platform for Scalable Deep Learning
Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Cerebro - An Efficient, End-to-End Platform for Scalable Deep Learning

Abstract

Deep Learning (DL) has emerged as a powerful tool for solving complex problems invarious domains, including natural language processing and computer vision. DL, being an empirical process, requires tuning of hyperparameters and exploring neural network architectures which involve significant compute resources, storage, memory, time, and human effort. While tools exist to address challenges associated with large datasets or with large DL models, there is a notable scarcity of comprehensive solutions that efficiently handle both large-scale models as well as large-scale datasets. The advent of Transformers and Large Language Models(LLMs) have underlined these problems and made overcoming them ever more significant. Unlike big tech, these issues are particularly acute for small-scale companies and individuals. There is a need to democratize large-scale DL. As a response, we propose a novel end-to-end platform that provides efficient scaling of DL in a cluster, regardless of the size of the datasets or models. Our platform can preprocess data, train, validate, and test models, as well as visualize results - all under one roof. Cerebro achieves this through its novel scheduler which is a hybrid of task, data and model parallelism. Our design supports fault tolerance and cluster resource heterogeneity. Implementing Cerebro’s user-friendly templates makes scaling DL effortless, allowing users to work seamlessly with the same familiarity as on their local machines. To evaluate our platform, we conducted experiments on various DL tasks, including image captioning and object detection. The experiments demon- strated that our platform provides efficient scaling of DL workloads, significantly reducing the time, effort, and resource costs required for large-scale model selection. This thesis describes the methods and approaches taken in the design and development of the Cerebro platform. We also discuss in detail our experimental observations of Cerebro in action and outline the directions this work can take in the future.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View