Deep Neural Networks (DNNs) are increasingly adopted in various fields due to their unprecedented performance. Yet, the computation overhead of DNN evaluation and training continues to grow exponentially. Enormous system-level advancements have recently been witnessed to improve the efficiency of DNN computation; many efficient DNN algorithms are proposed to reduce the cost of DNN computation. However, this is far from optimal. Most current DNN computation systems do not fully exploit efficient ML algorithms, while ML algorithms fail to consider novel systems for deployment.
This thesis focuses on designing efficient DNN algorithms and DNN computation systems to enable fast DNN training and evaluation. Instead of solely focusing on creating either new DNN algorithms or systems, we propose a co-design approach by exploring DNN models that are system-aware and DNN computation systems that are algorithm-aware. By leveraging such a co-design approach, we effectively advance the Pareto-frontier between task accuracy and efficiency of DNN execution in various application domains.
We present a set of works that explore the co-design method. Firstly, we present an algorithm-aware DNN compiler for quantized DNN. By leveraging the weight repetition feature of this efficient DNN algorithm, we can greatly reduce the computation overhead of DNN inference on both CPU and GPU. %This work illustrates how algorithm-aware system design can help in pushing the Pareto-frontier. Secondly, we discuss a hardware-aware DNN algorithm with enhanced model parallelism. We observe that previous works design efficient DNNs for single-device platforms. When customizing the DNN design for a multi-device system, we can reduce the DNN inference latency by a large margin while previous models can hardly be parallelized across multiple devices. Thirdly, we present a hardware-friendly transfer learning framework for natural language processing tasks. The existing transfer learning frameworks have a lot of computation redundancy when deploying on the existing systems. By reusing the computation of different transfer learning models, we can greatly reduce the computation overhead as well. Lastly, we introduce a novel training method to reduce the computation cost of DNN training and DNN design process. The key idea is to initialize the large models using small pretrained weights. The implicit knowledge in the pretrained models facilitates faster convergence of the large models. Besides, changing only the initialization phase of training means no extra computation overhead will be introduced to the existing training systems. Also, this new training method can be applied to accelerate the design process of system-aware DNN models.
As Moore's Law is slowing down, the computational capacity of current DNN systems is plateauing. This thesis sheds light on how to overcome this limitation by designing domain-specific DNN algorithms and computation systems.