As the complexity of processors increases, it becomes harder for
designers to understand the non-trivial and many times non-intuitive
interactions among the micro-architecture internal structures. Understanding
these interactions is important because it helps pinpoint bottlenecks, enabling
designers to reason about sources of performance loss and improve their next
generation of processors. To help designers understand these interactions in
current and, more importantly, in future generation designs, designers make
heavy use of computer architecture detailed simulation. These simulators model
the behavior of the processor on a per-cycle basis, allowing designers to look
at very detailed trade-offs. Building and maintaining these simulators is a
large and complicated task. In addition, recent trends in designing
micro-architectures with multiple cores in the same chip brings new challenges
that affect the way simulation results should be compared. This dissertation
focuses on techniques to help build and maintain simulators, as well as
techniques to improve the way architects evaluate design choices using
simulation.
Existing user-level simulators require manual hand coding for
the emulation of each and every possible system effect (e.g., system call,
interrupt, DMA transfer) that can impact the application.s execution.
Developing such an emulator for a given operating system is a tedious exercise,
and it can also be costly to maintain it to support newer versions of that
operating system. Furthermore, porting the emulator to a completely different
operating system might involve building it all together from scratch. The first
contribution of this dissertation is a technique to automatically capture the
system effects to an application. The system effects are captured in logs and
then used to guide achitecture simulation. By using the proposed technique,
the complexity of implementing and maintaining user-level simulators is greatly
reduced. In addition, the technique guarantees deterministic simulation on
uni-processor systems.
As multi-core processors become main stream,
techniques to address efficient simulation of multi-threaded workloads are
needed. Simulation of multithreaded workloads on multi-core systems suffer from
non-determinism across runs in different architecture configurations. If the
execution paths between two simulation runs of the same benchmark, with the
same input, are too different, the simulation results cannot be used to compare
the configurations. The other contributions of this dissertation focus on
techniques to efficiently collect simulation checkpoints for multi-threaded
workloads. It extends the previous technique to efficiently collect logs for
uni-processor simulation. Using these checkpoints, multi-threaded simulation in
multi-core systems becomes deterministic. The deterministic simulation results
in stalls that would not naturally occur in execution. This dissertation
proposes techniques that allow one to accurately compare performance across
architecture configurations in the presence of these stalls.
Pre-2018 CSE ID: CS2007-0907