The breakthroughs in multi-layer convolutional neural networks (CNNs) have caused significant progress in the applications of image classification and recognition. The size of CNNs has continuously increased to improve their prediction capabilities on various applications, and it has become increasingly costly to perform the required computations. In particular, CNNs involve a large number of multiply-accumulate (MAC) operations, and it is important to minimize the cost of multiplication as it requires most computational resources.
This dissertation proposes cost-efficient approximate log multipliers, optimized for performing CNN inferences. Approximate multipliers have reduced hardware costs compared to the conventional multipliers but produce products that are not exact. The proposed multipliers are based on Mitchell's Log Multiplication that converts multiplications to additions by taking approximate logarithm. Various design techniques are applied to Mitchell Log Multiplier, including fully-parallel LOD, efficient shift amount calculation, and exact zero computation. Additionally, the truncation of the operands is studied to create the customizable log multiplier that further reduces energy consumption. This dissertation also proposes using the one's complements to handle negative numbers to significantly reduce the associated costs while having minimal impact on CNN performances. The viability of the proposed designs is supported by the detailed formal analysis as well as the experimental results on CNNs. The proposed customizable design at w=8 saves up to 88% energy compared to the exact fixed-point multiplier at 32 bits with just a performance degradation of 0.2% on AlexNet for the ImageNet ILSVRC2012 dataset.
The effects of approximate multiplication are analyzed when performing inferences on deep CNNs, to provide a deeper understanding of why CNN inferences are resilient against the errors in multiplication. The analysis identifies the critical factors in the convolution, fully-connected, and batch normalization layers that allow more accurate CNN predictions despite the errors from approximate multiplication. The same factors also provide an arithmetic explanation of why bfloat16 multiplication performs well on CNNs. The experiments with deep network architectures, such as ResNet and Inception-v4, show that the approximate multipliers can produce predictions that are nearly as accurate as the FP32 references, while saving significant amount of energy compared to the bfloat16 arithmetic.
Lastly, a convolution core that utilizes the approximate log multiplier is designed to significantly reduce the power consumption of FPGA accelerators. The core also exploits FPGA reconfigurability as well as the parallelism and input sharing opportunities in convolution to minimize the hardware costs. The simulation results show reductions up to 78.19% of LUT usage and 60.54% of power consumption compared to the core that uses exact fixed-point multipliers, while maintaining comparable accuracy on the LeNet for MNIST dataset.