Skip to main content
eScholarship
Open Access Publications from the University of California

Time-Sharing Redux for Large-Scale HPC Systems

  • Author(s): Hofmeyr, S
  • Iancu, C
  • Colmenares, J
  • Roman, E
  • Austin, B
  • et al.
Abstract

© 2016 IEEE. HPC facilities typically use batch scheduling to space-share jobs. In this paper we revisit time-sharing using a trace of over 2.4 million jobs obtained during 20 months of operation of a modern petascale supercomputer. Our simulations show that batch scheduling produces skewed distributions with much larger slowdowns for shorter-running, larger jobs, whereas time-sharing produces more uniform slowdowns. Consequently, for applications that strong scale, the turnaround time does not scale with batch scheduling, but it does with time-sharing, resulting in turnarounds that are orders of magnitude better at the largest scales. We also show that time-sharing can confer additional benefits in noisy systems and with modern programming practices. Future Exascale HPC systems are expected to exhibit billion-way heterogeneous parallelism and poor performance predictability. As many applications will run in strong scaling, how resource allocation policies affect the experience of supercomputer users has once again become a timely subject.

Main Content
Current View