Methodologies for Evaluating Memory Models in gem5
Recent trends in computer applications and the rate of data generation in the world has created a huge demand for high performance computing. Architecture simulators play a key role in improving the performance of computer hardware. Simulators should be validated by comparing their accuracy against real hardware to prevent faulty data from misleading researchers and engineers. Most cycle accurate simulators have been through such a process. However, full system simulators like gem5 have not been subject to such evaluations. Moreover, to speed up the simulation process gem5 is designed as an event based simulator abstracting away transient details of components in a system making the simulator susceptible to abstraction errors. In this work we propose methodologies and tools for evaluating the accuracy of gem5's models for components in the memory subsystem. Rather than focusing on details of each component by inspecting the microarchitectural statistics, we consider the full system effect of each component. In our methodology, we propose using synthetic traffic to factor out any inaccuracy that might originate from processor models. We concluded that gem5's models for memory subsystem components do not exhibit any unexpected behavior. We also observed a 2x difference between the latency readings from gem5 and DRAMSim3. We believe this difference is caused by the level of abstraction in gem5's memory controller design. Moreover, we model a complete memory subsystem by using publicly available information on the Intel Skylake architecture and use the RandomAccess benchmark to evaluate the accuracy of the memory subsystem. To implement the memory subsystem configuration in the gem5 simulator we use an instance of the ruby cache model that implements a MOESI coherency protocol for a two level hierarchy. We use the same size, latency, and associativity for the L1 cache as the real hardware. Due to the lack of an L3 cache in our cache model, we use an L2 cache to represent both the L2 and L3 caches. For this cache we used the same size and associativity from the L3 in the real hardware, and used a weighted average of L2 and L3 access latencies. We report a 10% error in our GUPS measurements in our simulations. The difference is partially caused by the differneces in the configuration of caches. Moreover, gem5's lack of support for some of the microarchitectural components in the memory subsystem such as per-bank or per-rank memory queue structures adds to the difference between readings from simulation and real hardware. We believe studies that do not target changes in the memory subsystem could use this configuration as an evaluated set up for their experiments. Moreover, further fine tunings of the configuration could result in more accurate representation of the real hardware.