Lawrence Berkeley National Laboratory
Time-Sharing Redux for Large-Scale HPC Systems
- Author(s): Hofmeyr, S
- Iancu, C
- Colmenares, J
- Roman, E
- Austin, B
- et al.
Published Web Locationhttps://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0051
© 2016 IEEE. HPC facilities typically use batch scheduling to space-share jobs. In this paper we revisit time-sharing using a trace of over 2.4 million jobs obtained during 20 months of operation of a modern petascale supercomputer. Our simulations show that batch scheduling produces skewed distributions with much larger slowdowns for shorter-running, larger jobs, whereas time-sharing produces more uniform slowdowns. Consequently, for applications that strong scale, the turnaround time does not scale with batch scheduling, but it does with time-sharing, resulting in turnarounds that are orders of magnitude better at the largest scales. We also show that time-sharing can confer additional benefits in noisy systems and with modern programming practices. Future Exascale HPC systems are expected to exhibit billion-way heterogeneous parallelism and poor performance predictability. As many applications will run in strong scaling, how resource allocation policies affect the experience of supercomputer users has once again become a timely subject.