Deep Learning has become one of the most important tools in computer science in the last decade because of its ability to perform tasks that involve complex reasoning. Researchers have been pushing forward the state of the art in many ways, but we focus on two major trends that have shown promise. The first is creating models which can change based on the inputs to better learn patterns in the data. For example, with Graph Neural Networks, training the same model over two different graphs will result in models which have been specialized to the structure of the particular graph. This specialization results in the same model architectures having very different computation patterns depending on the input graph. The second is the rapidly increasing size of models, measured by the number of parameters. This increase leads to much more general and accurate machine learning agents, but requires a large amount of hardware to train efficiently. Because of these complexities, these trends result in state of the art models requiring massive amounts of computational and financial resources, limiting them to larger well-funded companies that have the funds to experiment.
Lowering the financial barrier to entry for experimentation with these models will allow many more small businesses and research labs to experiment with them. The current paradigm of machine learning focuses on throwing high-end computational power at the Deep Learning problem as the resource requirements grow. However, a key observation for addressing this problem lies in the increasingly heterogenous offering made by cloud providers. Cloud platforms offer a much more diverse set of resources than would be available to most on-site clusters allowing for a more fine grained approach to the problem. By combining the heterogeneous offers of cloud providers with in-depth knowledge of the compute profile of deep learning models, we can achieve better value compared to most existing systems. Here, value is defined as the performance per dollar we can achieve.
My first system, Dorylus, a Graph Neural Network framework, provides up to 3.86x more value than its GPU based variant running on large graphs by employing serverless threads to do asynchronous computation. My next work, Bamboo, focuses on reducing the cost of training massive pipeline-parallel models by making preemptible instances resilient and performant by intelligently introducing computation redundancy into the pipelines. Bamboo was able to provide almost 2x the performance-per-dollar of training with full-priced non-preemptible instances, and provide 1.5x the performance-per-dollar of existing systems designed to run on spot instances.