- Main
Time-Sharing Redux for Large-Scale HPC Systems
Published Web Location
https://doi.org/10.1109/hpcc-smartcity-dss.2016.0051Abstract
HPC facilities typically use batch scheduling to space-share jobs. In this paper we revisit time-sharing using a trace of over 2.4 million jobs obtained during 20 months of operation of a modern petascale supercomputer. Our simulations show that batch scheduling produces skewed distributions with much larger slowdowns for shorter-running, larger jobs, whereas time-sharing produces more uniform slowdowns. Consequently, for applications that strong scale, the turnaround time does not scale with batch scheduling, but it does with time-sharing, resulting in turnarounds that are orders of magnitude better at the largest scales. We also show that time-sharing can confer additional benefits in noisy systems and with modern programming practices. Future Exascale HPC systems are expected to exhibit billion-way heterogeneous parallelism and poor performance predictability. As many applications will run in strong scaling, how resource allocation policies affect the experience of supercomputer users has once again become a timely subject.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-