Partition Pruning: Parallelization-Aware Pruning for Dense Neural Networks

As recent neural networks are being improved to be more accurate, their model’s size is exponentially growing. Thus, a huge number of parameters requires to be loaded and stored from/in memory hierarchy and computed in processors to perform training or inference phase of neural network processing. Increasing the number of parameters causes a big challenge for real-time deployment since the memory bandwidth improvement’s trend cannot keep up with models’ complexity growing trend. Although some operations in neural networks processing are computational intensive such as convolutional layer computing, computing dense layers face with memory bandwidth bottleneck. To address the issue, the paper has proposed Partition Pruning for dense layers to reduce the required parameters while taking into consideration parallelization. We evaluated the performance and energy consumption of parallel inference of partitioned models, which showed a 7.72x speedup of performance and a 2.73x reduction in the energy used for computing pruned fully connected layers in TinyVGG16 model in comparison to running the unpruned model on a single accelerator. Besides, our method showed a limited reduction in accuracy while partitioning fully connected layers.


I. INTRODUCTION
Neural networks have become ubiquitous in applications that include computer vision, speech recognition, and natural language processing. The demand for processing neural network applications on edge devices, including smart phones, drones, and autonomous vehicles, is increasing [1]. Meanwhile, the size of neural network models has been drastically increased over time, reaching beyond the Peta scale [1]. In 1998, a handwritten digits classifier had about 1 M parameters [2], but in 2012, an image classifier for the ImageNet [3] dataset had more than 60 M parameters. In addition, Neural Talk, which automatically creates proper captions for ImageNet dataset has more 230 M parameters [4]. The top 5 error accuracy has been reduced by 30% each year, while this trend drastically increases the number of layers, parameters, and operations [1]. As recent neural networks are getting significantly large, nowadays memory is one of the biggest challenges in deep learning hardware. Memory is used to store input data, activation parameters, temporary output in inference or training phase of neural network processing. Memory capacity appears to be a limitation to store the huge amounts of weights and parameters in DNNs. Furthermore, another important memory bottleneck is memory bandwidth, especially with off-chip memory [27]. To address the issue, researchers must apply innovations in algorithm, architecture and circuit levels. Both memory bandwidth bottleneck and computational complexity lead to the need for sparsity and/or reducing the number of parameters in a neural network. For example, AlexNet requires 234 MB of memory space for storing parameters and requires 635 million arithmetic operations for feed-forward processing. AlexNet's convolutional layers are locally connected, but they are followed by fully connected layers that make up 95% of the connections in the AlexNet network [6]. Fully connected layers are over-parameterized and tend to overfit the training data. At the algorithm level, pruning methods were proposed before deep learning became popular. Based on the assumption that many parameters are unnecessary, pruning methods remove these parameters, resulting in expanding sparsity of layers [7]. Previous research has sought to reduce the number of parameters. Dropping out random connections was proposed by [11]. The Optimal Brain Damage [12] and Optimal Brain Surgeon [13] reduced the number of connections according to the loss function. Singular value decomposition (SVD) decreased the number of weights [14]. Another approach , adopted by the GoogleNet model [15], exploits the convolutional layers rather than the fully connected layers. This resulted in sparse layers that provided three benefits [16]. First, sparse layers required less storage for space for parameters. Second, it omitted computation of the removed edges, which reduced power consumption and latency. Third, it required less memory bandwidth to transfer parameters from the DRAM. In this paper, based on the insight that smart pruning can reduce the number of off-chip accesses, we propose a new scheme of pruning to increase parallelization and reduce required memory bandwidth. This way, we partition a large matrix into small matrices and distribute them to multiple computational units. The proposed pruning algorithm has three objectives: first enhancing the parallelism among accelerators, second reducing the number of off-chip accesses, and third maintaining the accuracy as high as the baseline. The experimental results show that the proposed scheme can increase the speed up by 7.72x and energy efficiency by 2.73x, respectively. The rest of this paper is organized as follows. Section II de- scribes the smart pruning algorithm followed by Experimental setup and evaluation methodology in Section. III. We discuss the result in Section. IV. And finally, we conclude the paper in Section. V.

II. PARALLELIZATION-AWARE PRUNING
The idea of Smart Pruning stands on pruning dense layers in neural networks while increasing the parallelism degree by partitioning the layers. At high-level diagram, a trained model who has one or many dense layers is used as an input to Partition Pruning method as shown in Figure 1. Thus, in the framework, First, a neural network model is trained. Then, fully connected layers of each model are pruned using the Partition Pruning approach. Then, inference of pruned neural network was performed on multiple processing cores.

A. Partition Pruning Method
Our framework targets neural networks that have some or all of their nodes fully connected to the subsequent nodes. The set of starting nodes, N initial is fully connected to the subsequent nodes N f inal , i.e. fully-connected layers. A link, which is a parameter, is a connection represented by L ij , where i is the starting node number and, j is the connected node number within a layer. The link's value (i.e the parameter's weight) is represented by w i,j . L i,j = 0 if the link is pruned, and if not, L i,j = 1. Note that w i,j may contain any value. The set of weights, W i , consists of links, L i , that connect between the set of Nodes, N i , and N j . Figure 2a shows an example of a fully connected layer of size 6 × 8. Figure 2b shows the matrix representation of the fully connected layer. While Figure 2c indicates the weight matrix of the fully connected layer. The connectedness number, C, is simply; A fully connected layer is annotated as C f ull and thus; Therefore, the connectedness ratio, R, is:  illustrates the partitions of Fig 3 and the reduction of number of weights due to that partitioning. Given that there are |P | partitions, where a P x ∈ P , then any given N initial,j ∈ P x will not be in any other partition. The same goes for nodes in N f inal,i . More formally, is the constraint of the groupings of nodes in N initial and N f inal . That is, once a particular node is in a particular partition, it cannot be a member of another partition. Another way of stating this is: , bound to the number of N initial,i nodes that are members of a partition P n . The same is true for N f inal,i nodes. In addition, the number of partitions that contain the upper limit is |N initial | mod |P |, while the number that contain the lower limit is |P | − (|N initial | mod |P |). As   From the objective function, we determine which 1 − R |P | C f ull parameters are pruned for a particular fully connected layer while minimizing the cumulative weight-loss. The input to the Partition Pruning algorithm is a matrix representation, W fc,i , of the targeted fully connected layer, i. This is exemplified in Figure 2c. Note that the fully connected layer is assumed and asserted to be trained. That is, the parameters have the correct values for the targeted neural network's base accuracy. In a fully connected layer, every element of the matrix L fc,i is 1 (see Equation 2). After Partition Pruning, the output will be L part,i and the sum of all its elements would be RC f ull . This is exemplified in Figure 3b. Figure 5 shows the architecture of a System on Chip (SoC) that consists of general purpose cores, memory controllers, a DMA engine, and matrix multiplication accelerators all of which are connected through the system bus. To understand how the system level affects the accelerators' behavior, simulation infrastructures that can model these heterogeneous systems are needed. gem5-Aladdin system simulator [25] is used to evaluate the proposed architecture. This tool is an integration of a gem5 system simulator with an Aladdin accelerator simulator. It is a pre-RTL simulation infrastructure that models multiple accelerators and interactions with central processing units (CPUs) in an SoC that consists of Processing Elements (PEs), fixed-function accelerators, memory controllers, and interfaces. This simulator can model the accelerators' performance, area, and power [25] [26]. Multiple matrix multiplication units are connected to the bus. In the gem5-Aladdin system, the accelerators can invoke the DMA engine already present in the gem5. The DMA is used to transfer bulk data without the CPU's intervention. The internal SRAM stores the weights, input features, and the outputs of the matrix multiplication. Each accelerator uses a 32 x 32 Systolic Array (SA). The SA architecture is a specialized form of parallel computing in which tightly coupled processing elements are connected to a small number of their nearest neighbors in a mesh-like topology. This architecture has a very low amount of global data transfer and can achieve a high clock frequency. However, SA architecture suffers from scalability issues due to the shape being fixed.

B. Multi-Core Organization
In an SA, the horizontal systolic movements are for implementing data broadcasts, and the vertical ones are for implementing accumulations [9].

III. EXPERIMENTAL SETUP
Fully connected layers are pruned by using Partition Pruning for three networks that use a TinyImageNet [21] dataset. which consists of 100,000 training images, 10,000 validation images, and 10,000 testing images that have dimensions of 64x64x3, and that classify 200 labels. These images are taken from the ImageNet [3] dataset, cropped into squares, and resized to 64x64. For each network, the fully connected layers are partitioned to 2, 3, 4, and 5 partitions, resulting in the pruning of 50%, 66%, 75%, and 80%, of the fully connected links, respectively.
Initially, the neural networks are trained and evaluated on a TinyImageNet dataset, as shown in Table II. Convolutional neural networks represent the state-of-the-art in image classification. AlexNet [22] and VGG16 [2] are well-known deep convolutional neural networks that have previously won ImageNet competitions. TinyVGG16 and TinyAlexNet use a 56x56x3 input image instead of 228x228x3, as do the original VGG16 and AlexNet. Each network has three fully connected layers at the end its structure. Partition Pruning prunes the first two of these three fully connected layers. The omission of pruning the last fully connected layer is due to the fact that every link is required for classification. If pruned, the classification accuracy would be affected the considerably and detrimental to the performance of the Neural Network model. Table II shows the benchmarks' baseline performances. After training the networks, Partition Pruning is applied to two, of the three, fully connected layers. Google's TensorFlow [23] version 1.7 was used to model the benchmarks. Partition Pruning was implemented in Python 2.7 and was given the NumPy matrices from the first two fully connected layers of the benchmarks. Then, the weights were updated in the TensorFlow model files using the resulting output filters. Note that, as mentioned earlier, gem5-Aladdin is used to evaluate the performance.
IV. RESULTS Table II shows the initial baseline accuracies, without pruning, of the TensorFlow implementations of the neural network benchmarks. Figure 6 shows the resulting accuracy losses of the Partition Pruning algorithm for TinyVGG16 and TinyAlexNet. Note that results for retraining are also shown. Accuracy loss increases when the number of partitions is increased, given that more parameters are pruned. After pruning, retraining the models reduces the loss of accuracy. For example, in 3-Partition, retraining reduces accuracy loss in TinyVGG16 from 10.59% to 0.87%. As Figure 5 shows, running inference of partitioned TinyVGG16 layers on different accelerators speeds performance and reduces energy consumption. These results are in comparison to running inference of the unpruned layers on signle accelerator. For example, running this benchmark on a triple-core accelerator executes 7.72x faster while consuming 2.73x less energy. This is because pruning reduces the size of the benchmarks by a factor correlated to the partition number (for example, by a factor of 2x for two partitions). In addition, running inference in parallel on multiple accelerators speeds the execution time. Therefore, the performance speed and the energy consumed by processing partitioned models were both improved by reducing the size of the models and using multiple hardware resources. Running the same benchmarks on multiple accelerators does not increase speed as expected. For example, running two identical workloads on two accelerators can increase speed 1.8x, and on three accelerator, 2.5x. This happens because all accelerators are connected to the same bus with one DMA, which leads to bus congestion. It is expected that using multiple large SAs, for example 256 x 256, would cause bandwidth bottlenecks and sizeable bus congestion. Although using a small SA does not provide high throughput processing, it leads to low power design because of the number of processing elements used in each accelerator.

V. CONCLUSIONS
This paper presented Partition Pruning, an approach that prunes fully connected layers of neural network models with the aim of partitioning for parallelization in order to improve speed and energy. The approach shows that by partitioning fully dense layers of TinyVGG16 to 3-Partition and executing the model on multiple accelerators, a speed increase of 7.72x and an energy reduction of 2.73x can be obtained. In addition, the pruning approach leads to less than %2 accuracy loss. Future work will evaluate a system that has multiple highbandwidth memories and neural network accelerators. Besides, more optimizations will be applied to the accelerators to minimize power consumption and increase throughput.