This thesis presents fast and accurate RTL simulation methodologies for performance, power, and energy evaluation as well as verification and debugging using FPGAs in the hardware/software co-design flow.
Cycle-level microarchitectural software simulation is the bottleneck of the hardware/software co-design cycle due to its slow speed and the difficulty of simulator validation. While simulation sampling can ameliorate some of these challenges, we show that it is often insufficient for rigorous design evaluations. To circumvent the limitations of software-based simulation and sampling, this thesis presents MIDAS v1.0, which automatically generates the FPGA-accelerated RTL simulator as an instance of FAME1 from any RTL. These simulators are not only up to three orders-of-magnitude faster than existing microarchitectural software simulators, but also truly cycle-accurate, as the same RTL is used to build the silicon implementation.
The increasing complexity of modern hardware design makes verification challenging, and verification often dominates design costs. While formal verification and unit-level tests can improve the confidence in some blocks or some aspects of a design, dynamic verification using simulators or emulators with real-world applications is usually the only plausible strategy for system-level RTL verification. Therefore, this thesis presents DESSERT, an effective simulation-based RTL verification methodology using FPGAs. The target RTL design is automatically transformed and instrumented to allow deterministic simulation on the FPGA with initialization and state snapshot capture. Assert statements, which are present in RTL for error checking in software simulation, are automatically synthesized for quick error checking on the FPGA, while print statements in the RTL design are also automatically transformed to generate logs from the FPGA for more exhaustive error checking. To rapidly provide waveforms for debugging, two parallel simulations are run spaced apart in simulation time to support capture and replay of state snapshots immediately before an error.
Energy efficiency is the primary metric for all computing systems, requiring designers to evaluate energy efficiency quickly and accurately throughout the design process. Prior abstract energy models are only accurate for designs closely matching the template for which the model was constructed and validated. Any energy model must be calibrated to a ground truth, usually a real physical system or a gate-level energy simulation. Validation of energy models is difficult because only a few design points will ever be fabricated as real systems and real systems typically lack adequate energy instrumentation, and gate-level simulation of proposed designs is extremely slow.
For fast and accurate power and energy evaluation of RTL, this thesis first presents Strober, a sample-based energy simulation methodology. Strober uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in RTL/gate-level simulation, resulting in a workload-specific average power estimate with its confidence interval. For arbitrary RTL and workloads, Strober guarantees orders-of-magnitude speedup over commercial CAD tools and gives average energy estimates guaranteed to be within very small errors with high confidence.
Runtime power modeling is also necessary for dynamic power/thermal optimizations. This thesis finally presents Simmani, an activity-based runtime power modeling methodology which automatically identifies key signals for the runtime power dissipation of any RTL design. The toggle pattern matrix, in which each signal is represented as a high-dimensional point, is constructed from the VCD dumps of a small training set. By clustering signals showing similar toggle patterns, an optimal number of signals are automatically selected, and then the design-specific but workload-independent activity-based power model is trained with regression against power traces obtained from industry-standard CAD tools. Simmani also automatically instruments the target design with activity counters to collect activity statistics from FPGA-based simulation, enabling runtime power analysis of real-world workloads at speed.