- Main
Deep Learning Performance Optimization via Model Parallelization
- Albaqsami, Ahmad
- Advisor(s): Bagherzadeh, Nader
Abstract
In recent years, machine learning (ML) and, more noticeably, deep learning (DL), have be- come increasingly ubiquitous. Applications of these technologies are being seen in many fields, including health care, manufacturing, and end-consumer services. In terms of deployment, deep neural networks (DNNs) are found in consumer devices, small internet-of-things devices, embedded in vehicles, and on a large scale in data centers and servers. The trend indicates that the use of DL in smart applications will continue to increase in the coming years.
As the name suggests, learning is an integral part of the functionality of DNNs, whether this learning takes place off-line before deployment, or happens in real time while the DNN is carrying out its assigned task. As part of the learning process, training is required to set the parameters, also known as weights, of the DNN in order to achieve high accuracy in the assigned task. Without training, the DNN is rendered useless, given that the parameters are not set correctly. It has been shown that this training process requires large amounts of data and a high number of training iterations for the DNN model to be effective. The weights are updated in each iteration based on the subset of the training data provided. The training process has proven to be a challenge given the long timescale involved. The amount of training data, the number of weights, and the computational complexity of updating those weights are all factors that contribute to this challenge. One way to reduce training time is to allocate processes to a multitude of processors, thus achieving some sort of sub-optimal parallelism. One approach is to have this decision be carried out by ML or DL experts. The problem with this is the absence of concrete information to ensure the best decision is taken: the time it takes for a particular process to run on a particular processor, and the costs of inter-communication between processors, are in fact unknown. Even with the intuition of an expert in this domain, a sub-optimal solution that outperforms a single-processor use case is not achieved.
In this dissertation, a hybrid-based multi-step optimization framework is presented. The framework explores the vast design space of mapping processes to processors. The search and evaluation are conducted in real time while training the DNN. In the first stage of the framework we compare the algorithmic intuitive approach with the Bayesian optimization (BO) approach. In the second stage of the framework, we create a predictive function for the performance of a single iteration of training, comparing the accuracy of different predictive functions created by different ML algorithms. The developed predictive model is then used as a surrogate function when identifying the best mapping. This stage in the search applies genetic algorithms (GA). An adaptive feature is also presented and tested for responsiveness to any changes that affect the performance of the training in the system.
We also present heterogeneous earliest finish time (HEFT): a deterministic approach to map- ping. In addition, we present the concept of node splitting, which refers to the computational graph of the DNN being split in order to accommodate a higher level of model parallelism. It is noted that this would also affect the accuracy of the DNN, since the hyperparameters are affected.
The framework and methodologies were evaluated in real, non-simulated systems using wall- clock time. The DNNs were built using Google’s ML/DL library, TensorFlow (TF).
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-