Flexibility, Scalability, and Efficiency in Next-Generation Digital Signal Processors
Skip to main content
Open Access Publications from the University of California


UCLA Electronic Theses and Dissertations bannerUCLA

Flexibility, Scalability, and Efficiency in Next-Generation Digital Signal Processors


Despite advancements in transistor density, the last decade has seen the slowing down of Moore’s law, an increasing silicon area cost, and an increasing number of dedicated accelerators in modern System on Chips (SoCs) and System in Packages (SiPs), leading to dark silicon. In trying to find alternate ways to fit more compute on a package in a cost effective way, leading chip manufacturers are adopting designs with more flexible hardware and their integration on silicon interposer based multi-chip platform technologies.Flexible chips can reuse hardware resources shared across algorithms, increasing active utilization of silicon and reducing required chip area. Additionally they can accommodate frequent design changes for constantly evolving standards such as 5G, which would otherwise require costly chip re-designs and re-spins. However, existing flexible architectures such as coarse-grain DSPs and CGRA significantly lag behind their dedicated accelerator counterparts in terms of throughput and energy and area efficiencies (10�-25�). There is a significant need today for flexible designs that are re-usable, have high throughput, and are also efficient enough for the strict energy and cost requirements of mobile and edge devices, in addition to ensuring compliance with the evolving protocols. Multi-chip scaling and heterogeneous integration can significantly lower manufacturing costs and time-to-market due to higher chip yields and IP-design reuse across multiple nodes. However, large interposer bump pitch, bulky inter-chip communication links, individual custom timing circuity, and lower channel bandwidths stand in the way of widespread adoption. Moreover, in energy- and cost-sensitive mobile applications, high channel efficiency coupled with low channel area serve as additional constraints. To address these challenges, this dissertation presents a flexible, domain-specific, 784-Core, Universal Digital Signal Processor (UDSP) array, targeting DSP applications (such as FIR, IIR, FFT and Vector-Dot-Product), achieving a 4.2� energy-efficiency gap and 6.4� area-efficiency gap from their ASIC counterparts, with high throughput (1.1GHz). The UDSP is realized with a course-grain domain-specific core that balances granularity and utilization, interconnected via a network tailored to DSP kernels with the “right” amount of connectivity. In addition, the trade-off between silicon area and compile flexibility is explored for multi-layer sparse switchbox designs resulting in an area- and time-efficient hardware-compiler co-optimized switchbox to further enhance design productivity. For advancing multi-chip scaling, this dissertation presents the 1st functional, 2�2 UDSP processor on a 2-layer Silicon Interconnect Fabric (Si-IF) with 10-�m pitch I/O bumps. Utilizing the proposed Streaming Near Range – 10�m (SNR-10) channel, the inter-chip links archive 0.38pJ/bit efficiency at 1.1GHz, and the highest bandwidth density per layer at 149Gbps/mm/layer. In an effort to further increase SNR-10 bandwidth without sacrificing technology portability, a 2.1mW, very wide range, 0.0032mm2, fully synthesizable DLL is developed. The DLL uses a ring oscillator and counter based coarse delay line to reduce area and increase frequency range. It uses an active-preemptive fine delay line switching scheme to reduce DNL, and uses independent dual-edge delays to allow duty cycle tracking, enabling high-speed DDR links in future revisions of SNR-10.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View