HPC application address stream compression, replay and scaling
- Author(s): Olschanowsky, Catherine Rose Mills;
- et al.
As the capabilities of high performance computing (HPC) resources have grown over the last decades, a performance gap has developed and expanded between the processor and memory. Processor speeds have improved according to Moore's law, while memory bandwidth has lagged behind. The performance bottleneck created by this gap, termed the "Von Neuman bottleneck," has been the driving force behind the development of modern memory subsystems. Many advances have been made aimed at hiding this memory bottleneck. Multi-level cache structures with a variety of implementation policies have been introduced. Memory subsystems have become very complex and the effectiveness of their structure and policies vary according the behavior of the application running on the resource. Memory simulation studies aid in the design of memory subsystems and in acquisition decisions. During a typical acquisition, candidate resources are evaluated to determine their appropriateness for a pre-defined workload. Simulation-aided models provide performance predictions when the hardware is not available for full testing ahead of purchase. However, address streams of full applications may be too large for direct use, complicating memory subsystem simulation. Memory address streams are extremely large. They can grow at a rate of over 2.6 TB/hour per core. HPC workloads contain applications that run for days across hundreds of processors, generating address streams whose handling is intractable. However, the memory address streams contain a wealth of information about the behavior of applications, that is largely inaccessible. This work describes a novel compression technique, specifically designed to make the information within HPC application address streams accessible and manageable. This compression method has several advantages over previous methods: extremely high compression rates, low overhead, and a human readable format. These attributes of the compression technique enable further, previously problematic, studies. High compression ratios are a necessity for application address streams. Address streams are very large, making them challenging to collect and store. Furthermore, any simulation experiment performed using the stream will be limited by disk speeds, since there is no other plausible place to store and retrieve such volumes of data. The compression technique presented has demonstrated compression ratios in the hundreds of thousands of times. This leads to file sizes that can easily be emailed between collaborators and the format can be replayed at least as fast as disk speeds. The collection overhead for an address stream must be low. The collection takes place on an HPC resource, and HPC resource time is costly. This compression technique has an unsampled average slowdown of 90X. This slowdown is an improvement of the state-of-the- art. The compressed address stream profiles are human readable. This attribute enables new and interesting uses of application address streams. It is possible to experiment with hypothetical code optimizations using simulation or other metrics rather than actually implement the optimizations. Strong scaling analysis of memory behavior is historically challenging. High-level metrics such as execution time and cache miss rates do not lend well to strong scaling studies because they hide the true complexity of the application-machine interactions. This work includes a strong scaling analysis in order to demonstrate the advanced capabilities that can be built upon this compression technique