Checkpoint/Restart Vision and Strategies for NERSC’s Production Workloads
Skip to main content
eScholarship
Open Access Publications from the University of California

Checkpoint/Restart Vision and Strategies for NERSC’s Production Workloads

Published Web Location

https://doi.org/10.2172/1814161
Abstract

As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific productivity for users, provides scheduling flexibility for computing centers, and protects against system failures. While both applicationspecific (or application-level) and transparent C/R are used in practice, we are interested in transparent checkpointing, which is vital for system-level checkpointing. Developing and maintaining transparent C/R tools for HPC applications, however, is labor intensive and highly complex due to ever-changing HPC systems and diverse production workloads. Existing C/R tools are often research-oriented, so there is a gap to close before they can be used reliably with production workloads, especially on cutting edge HPC systems. In this position paper, we present our journey to prepare a production-ready MPI-Agnostic Network-Agnostic (MANA) transparent checkpointing tool for NERSC, and share our vision and strategies to bring transparent C/R capabilities to NERSC’s production workloads on current and future systems.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View