Pal, Saptadeep

Scale-Out Packageless Processing

2021

Abstract

Demand for increasing system performance is far outpacing the capability of conventional methods for performance scaling. Traditionally, performance and energy scaling has relied on transistor and silicon scaling. However, developing chips, often very large ones in the advanced technology nodes is becoming very challenging and costly. Moreover, system performance is often limited by inter-die connections. Today, dies with different functionality are packaged and integrated using PCBs. Unlike silicon features, package and PCB features have barely scaled (about 4-5x) over the past few decades. This severely limits performance and efficiency of processor systems. Moreover, next-generation of applications driven by artificial intelligence, and other data intensive applications are driving the demand for very large scale-out systems. Traditional scale-out system building and integration methodologies are failing to deliver the performance these applications demand. As a result of the above trends, future performance, power, and cost improvements cannot come from improvements in transistor technology alone. Then, how do we enable “System scaling”?

In this dissertation, first we show that packages inhibit system scaling as it reduces the potential memory bandwidth of a processor by at least one order of magnitude, allowable thermal design power (TDP) by up to 70%, and area efficiency by a factor of 5 to 18. We therefore propose packageless processors - processors where packages have been removed and dies directly mounted on a silicon board using a novel integration technology, Silicon Interconnection Fabric (Si-IF). We show that Si-IF-based packageless processors outperform their packaged counterparts by up to 58% (16% average), 136% (103% average), and 295% (80% average) due to increased memory bandwidth, increased allowable TDP, and reduced area respectively. We also extend the concept of packageless processing to the entire processor and memory system, where the area footprint reduction was up to 76%. To guide technology direction for dielet integration substrate technologies, we also developed a die-to-die interconnect pathfinding tool to explore the effects of physical trade-offs such as bump pitch, wire pitch, I/O ESD capacitance etc. We show that incessant reduction of bump and wire pitch below 10 um wouldn't be helpful for interconnect performance and we need to develop techniques and technologies to minimize reliance on large ESD structures in the chiplet I/Os as ESD capacitance starts dominating performance and energy cost of these die-to-die interconnect links. Next, we show that fine pitch chiplet integration technologies allow us to disintegrate large SoCs in to chiplets with minuscule hit in performance. This opens up the opportunity to build a chiplet ecosystem, where application-optimized systems can be built by selecting a subset of chiplets from a chiplet pool. Such an ecosystem however needs us to find the suitable minimal set of chiplets to build in order to target a variety of workloads efficiently. To that end, we developed the first chiplet selection framework to target a large variety of applications. We show that up to 35% improvement in EDP can be obtained from application-specific system customization and when total cost of design and manufacturing is considered, up to 72% benefit in cost is possible over SoCs.

Part 2 of the dissertation focuses on scale-out processing systems. To target scale-out systems, we propose chiplet-based waferscale processors to dramatically reduce communication overheads. The Si-IF technology can be used to build scale-out processors up to a size of an entire wafer. However, building such a large consolidated waferscale system has its own challenges. Using a waferscale GPU as a case study, we showed that while a 300 mm wafer can house about 100 GPU modules (GPMs), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We analyzed the design space of power-delivery network, cooling and trade-offs of yield and inter-GPM network topologies, and proposed optimized waferscale GPU architecture. We also optimized thread scheduling and data placement policies. Overall, our simulations show that an optimized waferscale architecture can provide up to 19x speedup compared to traditionally integrated systems. Then, we architected and designed a 14,336-core shared memory waferscale system in order to understand the design challenges of waferscale processors. Several aspects of the design were built from the ground up due to the scale of the system:power delivery and on-chip regulation methods, reliable waferscale clock distribution, wafer-scale fault-tolerant network design, chiplet and waferscale system test mechanisms, and multiple physical and architectural techniques to enhance system yield. The chiplets were taped out in TSMC N40-LP process and a smaller prototype system has been functionally verified.Next, we have focused on understanding the scalability characteristics of deep learning (DL) training applications and exploring the cross-stack impact of hardware-software-technology co-design at-scale. With the aid of an optimal operation-to-device placement tool, we have proposed a framework which allows us to figure out when to use model parallelism with data parallelism instead of data parallelism alone in order to minimize end-to-end training time. Next, we developed a system-technology co-optimization tool which explores the cross-stack impact of technology scaling, model scaling and architectural innovations on end-to-end DL training time. Using this tool, we can perform rapid-yet-accurate design space exploration and find optimal architectures under given logic, memory, and inter-chip interconnect technology parameters.

Together, the techniques and methodologies developed in this dissertation lays the groundwork for a revolutionary new way of thinking about system scaling. Packageless processing and scale-out waferscale architectures can indeed provide orders of magnitude improvement in performance and energy efficiency required by next-generation of applications. Moreover, the cross-stack pathfinding tools provide rapid-assessment frameworks to understand bottlenecks across different levels in the system and helps guide technology optimal decisions for processing systems.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Scale-Out Packageless Processing