Modern system-on-a-chip (SoC) development is a highly complex process that spans multiple levels of design abstraction and cross-cutting requirements. With a rapidly evolving ecosystem of domain-specific accelerators and wide design spaces to search, the ability to rapidly evaluate potential chip designs has never been more important. In the modern chip- design landscape, field-programmable gate arrays (FPGAs) play a critical role in delivering this simulation capability due to their unique ability to emulate concrete, register-transfer level (RTL) designs at speeds sufficient to run real applications spanning trillions of cycles of simulated target-design execution. However, the use of FPGAs for logic emulation presents challenges, including the perennial difficulty of effectively mapping large target designs to the finite resources of a given FPGA platform.
To help address this challenge, this dissertation presents a novel approach to manage these limitations through the use of automatic resource-efficiency optimizations that reduce the number of FPGA resources required to faithfully implement cycle-accurate emulators of large chips, all without requiring the tedious manual effort and complexity of previous FPGA- optimized simulation techniques. By substituting target-design memories with logic-intensive read and write ports for resource-efficient, cycle-accurate models that serially access FPGA memory primitives, Golden Gate simulators can avoid the disproportionate impact of FPGA-hostile memory design patterns on simulators of high-performance processor cores. Drawing inspiration from software simulators and specialized emulators, where common code may be repeatedly executed to model an arbitrary number of copies of a given block, I also introduce an automatic instance-threading optimization, through which the logic resources required to simulate a given module may be shared across multiple instances, radically reducing their collective footprint.
To support the use of these optimizations across a broad array of user designs, they are integrated as contributions to Golden Gate, an extensible compiler that translates RTL designs into cycle-accurate FPGA simulators as part of the open-source FireSim FPGA simulation framework. By structuring simulators as modular dataflow networks, Golden Gate provides the flexibility to compose the two optimizations along with the ability to com- bine them with software co-simulation or other advanced simulation features. To evaluate the performance of the optimizations and to validate the optimizing compiler stack, these techniques are applied to two input designs: a general-purpose SoC with multiple out-of- order cores and a domain-specific accelerator with multiple systolic array co-processors. In each case, finite programmable logic resources limit the maximum number of cores–and therefore the size of the system–that can effectively be simulated on a simulation platform consisting of cloud-hosted Xilinx VU9P FPGAs. However, by enabling optimizations in Golden Gate through simple compiler directives, the same FPGA platform was able to support configurations of each system with an eight-fold increase in core count relative to the baseline, providing the ability to simulate sixteen out-of-order cores or eight accelerator cores at high speed, with deterministic, cycle-accurate results. Ultimately, this significant increase in per-FPGA capability broadens the utility of commodity FPGAs in simulating ever-growing chips, while the convenience of automatic compiler optimization helps support designer productivity in a rapidly accelerating hardware ecosystem.