Lee, Cynthia Bailey

On the user-scheduler relationship in high-performance computing

2009

Lee, Cynthia Bailey

Abstract

To effectively manage High-Performance Computing (HPC) resources, it is essential to maximize return on the substantial infrastructure investment they entail. One prerequisite to success is the ability of the scheduler and user to productively interact. This work develops criteria for measuring productivity, analyzes several aspects of the user-scheduler relationship via user studies, and develops solutions to some vexing barriers between users and schedulers. The five main contributions of this work are as follows. First, this work quantifies the desires of the user population and represents them as a utility function. This contribution is in four parts: a survey-based study collecting utility data from users of a supercomputer system, augmentation of the Standard Workload Format to enable scheduler research using utility functions, and a model for synthetically generating utility function-augmented workloads. Second, a number of the classic scheduling disciplines are evaluated by their ability to maximize aggregate utility of all users, using the synthetic utility functions. These evaluations show the performance impact of inaccurate runtime estimates, contradicting an oft quoted prior result [55] that inaccuracy of estimates leads to better scheduling. Third, a scheduler optimizing the aggregate utility of all users, using a genetic algorithm heuristic, is demonstrated. This contribution includes two software artifacts: an implementation of the genetic algorithm (GA) scheduler, and a modular, extensible scheduler simulation framework that simulates several classic scheduling disciplines and is interoperable with the Standard Workload Format. Fourth, the ability of users to productively interact with this scheduler by providing an accurate estimate of their resource (run time) needs is examined. This contribution consists of formalizing a frequent casual assertion from the scheduling literature, that users typically "pad" runtime estimates, into an explicit Padding Hypothesis, and then falsifying the hypothesis via a survey-based study of users of a supercomputer system. Specifically, absent an incentive to pad-and including incentives to be accurate-the inaccuracy of runtime estimates only improved from an average of 61% inaccurate to an average of 57% inaccurate. This contribution has implications not only for the proposed genetic algorithm scheduler, but for any scheduler that asks users for an estimate, which currently includes virtually all parallel job schedulers both in production use and proposed in the literature. Fifth, a survey of users of a supercomputer system and associated simulations explore the feasibility of removing one of the defining constraints of the parallel job scheduling problem-the non-preemptability of running jobs. An investigation of users' current checkpointing habits produced a workload labeled with per-job checkpoint information, enabling simulation of a checkpoint-aware GA scheduler that may preempt running jobs as it optimizes aggregate utility. Lifting the non-preemptability constraint improves performance of the GA scheduler by 16% (and 23% compared to classic EASY algorithm), including overhead penalties for job termination and restart

UC San Diego

On the user-scheduler relationship in high-performance computing