Approximate and Bit-width Configurable Arithmetic Logic Unit Design for Deep Learning Accelerator
- Author(s): Chen, Xiaoliang
- Advisor(s): Kurdahi, Fadi J
- et al.
As key building blocks for digital signal processing, image processing and deep learning etc, adders, multi-operand adders and multiply-accumulator unit (MAC) have drawn lots of attention recently. Two popular ways to improve arithmetic logic unit (ALU) performance and energy efficiency are approximate computing and precision scalable design. Approximate computing helps achieve better performance or energy efficiency by trading accuracy. Precision scalable design provides the capability of allocating just-enough hardware resources to meet the application requirements.
In this thesis, we first present a correlation aware predictor (CAP) based approximate adder, which utilizes spatial-temporal correlation information of input streams to predict carry-in signals for sub-block adders. CAP uses less prediction bits to reduce the overall adder delay. For highly correlated input streams, we found that CAP can reduce adder delay by $\sim$23.33\% and save $\sim$15.9\% area at the same error rate compared to prior works.
Inspired by the success of approximate multipliers using approximate compressors, we proposed a pipelined approximate compressor based speculative multi-operand adder (AC-MOA). All compressors are replaced with approximate ones to reduce the overall delay of the bit-array reduction tree. An efficient error detection and correction block is designed to compensate the errors with one extra cycle. Experimental results showed the proposed 8-bit 8-operand AC-MOA achieved 1.47X $\sim$ 1.66X speedup over conventional baseline design.
Recent research works on deep learning algorithms showed that bit-width can be reduced without losing accuracy. To benefit from the fact that bit-width requirement varies across deep learning applications, bit-width configurable designs can be used to improve hardware efficiency. In this thesis a bit-width configurable MAC (BC-MAC) is proposed. BC-MAC uses spatial-temporal approach to support variable precision requirements for both of activations and weights. The basic processing element (PE) of BC-MAC is a multi-operand adder. Multiple multi-operand adders can be combined together to support input operands of any precision. Bit-serial summation is used to accumulate partial addition results to perform MAC operations. Booth encoding is employed to further boost the throughput. Synthesis results on TSMC 16nm technology and simulation results show the proposed MAC achieves higher area efficiency and energy efficiency than the state-of-the-art designs, making it a promising ALU for deep learning accelerators.