Deep Neural Networks (DNNs) enable computers to excel across many different applications such as image classification, speech recognition and robotic control. To accelerate DNN training and serving, parallel computing is widely adopted. System efficiency is a big issue when scaling out. High communication overheads and limited on-device memory are two major causes for system inefficiency in distributed machine learning.
This dissertation studies possible ways to mitigate communication bottlenecks and achieve better on-device memory utilization in data and model parallelism for distributed machine learning workloads.
On the communication side, our Blink project mitigates communication bottleneck in data parallel training. By packing spanning trees rather than forming rings, Blink achieves higher flexibility in arbitrary networking environments and provides near-optimal network throughput. To eliminate the communication in model parallel training and inference, we go abovefrom system layer to application layer. Our sensAI project decouples a multi-task model into disconnected subnets, where each subnet is responsible for decision making of a single task or a subset of the original task-set.
Towards better utilization of on-device memory, our Wavelet project intentionally adds task launching latency to interleave peak memory usage across different waves of training tasks on the accelerators. By packing multiple training waves on the same accelerator, it improves both computation and on-device memory utilization.