Deep neural networks have demonstrated outstanding performance in various fields of machine learning, such as computer vision, speech recognition, and natural language processing. In particular, convolutional neural networks (CNNs) perform well in computer vision tasks, such as image recognition, object detection, and image segmentation. The abundance of training data, advanced computation hardware, and the use of graphical processing units (GPUs) make the training and deployment of deep CNN models plausible. CNN models usually consist of many layers that contain millions or hundreds of millions of trainable parameters. This large number of parameters requires high storage and extensive computation. Deploying these models is challenging for low-energy-constrained devices, such as mobile devices, Internet of Things (IoT) nodes, CPU robotics, and autonomous vehicles. A plethora of software and hardware methods have been introduced over the last five years to compress state-of-the-art deep neural network models for easy deployment at the edge.
This dissertation aims to investigate the use and combination of pruning, quantization, and tensor decomposition methods on state-of-the-art deep neural network models. We compare combined methods with the methods when applied individually in terms of storage and computation cost. Furthermore, we seek to explore and improve the accuracy of these methods using various ensemble techniques and different training routines.
Thus, we propose the FPTT method, which combines a pruning method and a tensor decomposition method. It reduces the number of parameters by 98% for some models and achieves a compression factor of 30.7× for others. Next, we use the ultimate compression method, which combines tensor decomposition with a binary neural network to compress the model. It achieves a compression ratio of 169.1× for some state-of-the-art models. We also present a method for improving the model’s inference accuracy, which uses logarithmic representation by averaging multiple quantized using stochastic rounding for the weights instead of deterministic rounding. This method achieves the same accuracy as floating-point models while reducing the computation and storage costs. We also improve binary neural network models using ensemble methods and filter sharing to reduce the storage and computation cost while improving the binary neural network accuracy. In addition, we improve binary neural networks by using a fixed rank for tensor train decomposition, which increases the model accuracy by 2%–4%. We employ our methods in two different case studies. In the first case study, we apply the ultimate compression method on crowd counting application and achieve a compression ratio of 23× compared with floating-point models. In the second case study, we modify the MobileNet-v2 neural network model for an emotion classification application using EEG signals by binarizing the model and replacing the 2D convolutional layer with a 3D one. We improve the binary model’s accuracy using different methods, achieving an accuracy of only 2%–5% less than the floating-point model’s counterpart while reducing the storage size by 40%.