Search

Scholarly Works (3 results)

Sort By:

Thesis
Peer Reviewed

Improving FPGA Simulation Capacity with Automatic Resource Multi-Threading

UC Berkeley Electronic Theses and Dissertations (2021)

Modern system-on-a-chip (SoC) development is a highly complex process that spans multiple levels of design abstraction and cross-cutting requirements. With a rapidly evolving ecosystem of domain-specific accelerators and wide design spaces to search, the ability to rapidly evaluate potential chip designs has never been more important. In the modern chip- design landscape, field-programmable gate arrays (FPGAs) play a critical role in delivering this simulation capability due to their unique ability to emulate concrete, register-transfer level (RTL) designs at speeds sufficient to run real applications spanning trillions of cycles of simulated target-design execution. However, the use of FPGAs for logic emulation presents challenges, including the perennial difficulty of effectively mapping large target designs to the finite resources of a given FPGA platform.

To help address this challenge, this dissertation presents a novel approach to manage these limitations through the use of automatic resource-efficiency optimizations that reduce the number of FPGA resources required to faithfully implement cycle-accurate emulators of large chips, all without requiring the tedious manual effort and complexity of previous FPGA- optimized simulation techniques. By substituting target-design memories with logic-intensive read and write ports for resource-efficient, cycle-accurate models that serially access FPGA memory primitives, Golden Gate simulators can avoid the disproportionate impact of FPGA-hostile memory design patterns on simulators of high-performance processor cores. Drawing inspiration from software simulators and specialized emulators, where common code may be repeatedly executed to model an arbitrary number of copies of a given block, I also introduce an automatic instance-threading optimization, through which the logic resources required to simulate a given module may be shared across multiple instances, radically reducing their collective footprint.

To support the use of these optimizations across a broad array of user designs, they are integrated as contributions to Golden Gate, an extensible compiler that translates RTL designs into cycle-accurate FPGA simulators as part of the open-source FireSim FPGA simulation framework. By structuring simulators as modular dataflow networks, Golden Gate provides the flexibility to compose the two optimizations along with the ability to com- bine them with software co-simulation or other advanced simulation features. To evaluate the performance of the optimizations and to validate the optimizing compiler stack, these techniques are applied to two input designs: a general-purpose SoC with multiple out-of- order cores and a domain-specific accelerator with multiple systolic array co-processors. In each case, finite programmable logic resources limit the maximum number of cores–and therefore the size of the system–that can effectively be simulated on a simulation platform consisting of cloud-hosted Xilinx VU9P FPGAs. However, by enabling optimizations in Golden Gate through simple compiler directives, the same FPGA platform was able to support configurations of each system with an eight-fold increase in core count relative to the baseline, providing the ability to simulate sixteen out-of-order cores or eight accelerator cores at high speed, with deterministic, cycle-accurate results. Ultimately, this significant increase in per-FPGA capability broadens the utility of commodity FPGAs in simulating ever-growing chips, while the convenience of automatic compiler optimization helps support designer productivity in a rapidly accelerating hardware ecosystem.

Article
Peer Reviewed

Golden Gate: Bridging The Resource-Efficiency Gap Between ASICs and FPGA Prototypes

UC Berkeley Previously Published Works (2019)

We present Golden Gate, an FPGA-based simulation tool that decouples the timing of an FPGA host platform from that of the target RTL design. In contrast to previous work in static time-multiplexing of FPGA resources, Golden Gate employs the Latency-Insensitive Bounded Dataflow Network (LI-BDN) formalism to decompose the simulator into subcomponents, each of which may be independently and automatically optimized. This structure allows Golden Gate to support a broad class of optimizations that improve resource utilization by implementing FPGA-hostile structures over multiple cycles, while the LI-BDN formalism ensures that the simulator still produces bit-and cycle-exact results. To verify that these optimizations are implemented correctly, we also present LIME, a model-checking tool that provides a push-button flow for checking whether optimized subcomponents adhere to an associated correctness specification, while also guaranteeing forward progress. Finally, we use Golden Gate to generate a cycle-exact simulator of a multi-core SoC, where we reduce LUT utilization by up to 26% by coercing multi-ported, combinationally read memories into simulation models backed by time-multiplexed block RAMs, enabling us to simulate 50% more cores on a single FPGA.

Article
Peer Reviewed

Golden Gate: Bridging The Resource-Efficiency Gap Between ASICs and FPGA Prototypes

UC Berkeley Previously Published Works (2019)

We present Golden Gate, an FPGA-based simulation tool that decouples the timing of an FPGA host platform from that of the target RTL design. In contrast to previous work in static time-multiplexing of FPGA resources, Golden Gate employs the Latency-Insensitive Bounded Dataflow Network (LI-BDN) formalism to decompose the simulator into subcomponents, each of which may be independently and automatically optimized. This structure allows Golden Gate to support a broad class of optimizations that improve resource utilization by implementing FPGA-hostile structures over multiple cycles, while the LI-BDN formalism ensures that the simulator still produces bit- and cycle-exact results. To verify that these optimizations are implemented correctly, we also present LIME, a model-checking tool that provides a push-button flow for checking whether optimized subcomponents adhere to an associated correctness specification, while also guaranteeing forward progress. Finally, we use Golden Gate to generate a cycle-exact simulator of a multi-core SoC, where we reduce LUT utilization by up to 26% by coercing multi-ported, combinationally read memories into simulation models backed by time-multiplexed block RAMs, enabling us to simulate 50% more cores on a single FPGA.