Deep learning has revolutionized the way humans interact with technology, enabling complex tasks that were once thought to be impossible. Its ability to process vast amounts of data and learn from patterns has led to significant advancements in various areas, such as image and speech recognition, natural language processing, and robotics. These advancements have had a significant impact on humans daily lives, from virtual assistants to self-driving cars, personalized recommendations on social media platforms, and fraud detection in banking and finance. While deep learning has enabled remarkable advancements across various industries, its wide adoption has led to alarming repercussions in terms of carbon emissions and energy consumption. This is due to the high computational and storage requirements of deep learning models, which result in the use of large data centers and computing infrastructures that consume a vast amount of energy and have become a significant contributor to carbon emissions. As deep learning models become increasingly complex, the energy consumption required for training andinference is expected to rise, exacerbating the problem. This becomes more crucial because the current computing infrastructures used for training and serving inference for these models are significantly underutilized. This PhD dissertation sets out to take on this imperative challenge and rethink the design of custom neural accelerators and their adoption in both cloud and edge infrastructures by devising solutions across the whole compute stack ranging from circuits to systems and algorithms. To that end, the contributions of this dissertation are as follows:
• Devising BIHIWE, a programmable mixed-signal DNN accelerator that leverages the innate energy-efficiency of analog computing. To address the challenges associated with analog computing, I leverage the mathematical properties of deep learning operations and define a new computing model for dot-product operations along with its mixed-signal computing circuitry. I further design a programmable hierarchical clustered architecture to integrate the mixed-signal compute units and propose solutions to further mitigate the non-ideality in analog computing.
• Designing ultra-energy efficient acceleration solution for deeply quantized neural networks. The proposed design intersperses bit parallelism within data-level parallelism and dynamically interweaves the two together. This design paradigm enables dynamic composition of narrow-bitwidth vector engines at the bit granularity based on the required bitwidth of the DNN layers. This new composition mode amortizes the cost of aggregation and operand-delivery across a vector of elements and brings forth significant energy savings and performance improvements.
• Proposing a novel microarchitecture and Instruction Set Architecture for a companion processor in neural accelerators that tackles the challenges associated with executing emerging and novel operations in DNNs. The design strikes a balance between customization and programmability to keep up with the volatility of deep learning research while offering significant performance and energy gains compared to prior work.
• Proposing Planaria, the very first neural accelerator design that offers simultaneous multi-tenant acceleration of DNNs. The design introduces and leverages the novel concept of runtime architecture fission, which breaks a monolithic accelerator into smaller yet full-fledged accelerators to enable spatial co-location of multiple DNN inference requests. To best utilize this microarchitecture capability, I also propose a task scheduling algorithm that breaks up the accelerator with respect to the current server load, DNN topology, and task priorities, all while considering the latency bounds of the tasks. This work opens a new dimension in the design of neural accelerators that considers utilization, cost-effectiveness, and responsiveness in datacenters.
• Devising a mathematical formulation for pruning the inconsequential operations in self-attention layers of transformer models. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. Additionally, I propose a bit-serial architecture, dubbed LEOPARD, to maximize the benefits by terminating computation even before pruning the following calculation without any approximation.