Significant improvements in the accuracy of Neural Networks (NNs) have been observed for a wide range of problems, often achieved by highly over-parameterized models. Despite the accuracy of these state-of-the-art models, their sheer size makes it impossible to deploy them for many resource-constrained applications, such as real-time intelligent healthcare monitoring, autonomous driving, audio analysis, and speech recognition. This creates a problem in realizing pervasive deep learning, which requires real-time inference with low energy consumption, high accuracy, and limited computational resources.
Achieving efficient NNs that can achieve real-time constraints with optimal accuracy requires the co-optimization of 1) NN architecture design, 2) model compression methods, and 3) the design of hardware engines. Previous work pursuing efficient deep learning focused more on optimizing proxy metrics such as memory size and the FLOPs, while the hardware specifications actually play an important role in determining the overall performance. Furthermore, due to the extremely large design space, the aforementioned three aspects are often optimized separately and empirically inprevious literature, making the whole design process time-consuming and sub-optimal.
In this dissertation, we first systematically studied the quantization method, which is a widely used and standard model compression technique. Instead of using a heuristic design or costly searching, we tackled the mixed-precision quantization problem by leveraging the Hessian information, and our proposed Hessian-AWare Quantization (HAWQ) method achieved state-of-the-art performance on different networks and datasets. We further made the whole pipeline fully automatic (HAWQV2) and explored different aspects of quantization (ZeroQ) on different tasks (QBERT).
Based on our systematic quantization method, we then included hardware specifications and deployment into the design space (HAWQV3). The neural architecture was taken into the co-design (CoDeNet) and was searched automatically as well (HAO). Finally, we increased the efficiency of the whole automatic HW-SW co-design pipeline by introducing teacher-based block-wise distillation (ETA). Overall, our work in this dissertation demonstrates steps in the evolution from traditional NN design toward hardware-aware efficient deep learning. We believe this will further acceleratethe deployment of advanced NNs on resource-limited devices and in real-world applications.