An Energy-Efficient SqueezeNet Implementation on the KiloCore Platform
Many Convolutional Neural Networks (CNNs) have been developed for object detection, image classification, and facial recognition applications. Although many deep convolutional neural networks have focused on improving accuracy, few have focused on reducing the number of required hardware resources. While reducing hardware requirements is expected to reduce throughput performance, these simpler architectures are expected to provide advantages such as lower latency, lower power, and smaller memory requirements. In addition, simpler CNNs can be implemented on more devices, and are in general easier to train because they contain fewer parameters which required to be trained. This thesis proposes a KiloCore implementation of SqueezeNet, a lightweight CNN that offers low energy and high throughput, and contains 1,248,424 parameters inside 22 layers composed of 18 convolutional layers and 4 pooling layers.
This thesis presents an implementation of SqueezeNet running on a fine-grain many-core processor array called KiloCore. The metrics to be compared include energy per frame, power, throughput, throughput per area, energy-delay product (EDP), and memory. We compare with: SqueezeNet implementations running on an Intel Xeon E3-1275 v5 @ 3.6 GHz, an Intel i5-5250U @ 2.7 GHz, an Intel Knights Landing @ 1.7 GHz, a Qualcomm Snapdragon 810 @ 1.5 GHz, an NVIDIA Pascal @ 3.0 GHz, and an ARMv71 @ 0.9 GHz.
The KiloCore many-core implementation achieves a 1.0× – 17.0× lower energy per frame and 3.1× – 35.3× lower power dissipation. Regarding throughput performance, the KiloCore implementation is 4.8× higher than ARMv71 processor. The EDP value for KiloCore implementation is in the middle range among other hardware platform implementations, and the EDP is 95.2× lower compared to an ARMv71 processor. SqueezeNet implementation on KiloCore has significantly fewer memory requirements than other programmable processors.