As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific productivity for
users, provides scheduling flexibility for computing centers, and protects against system failures. While both applicationspecific (or application-level) and transparent C/R are used in practice, we are interested in transparent checkpointing, which is vital for system-level checkpointing. Developing and maintaining transparent C/R tools for HPC applications, however, is labor intensive and highly complex due to ever-changing HPC systems
and diverse production workloads. Existing C/R tools are often research-oriented, so there is a gap to close before they can be used reliably with production workloads, especially on cutting edge HPC systems. In this position paper, we present our journey to prepare a production-ready MPI-Agnostic Network-Agnostic (MANA) transparent checkpointing tool for NERSC, and share our vision and strategies to bring transparent C/R capabilities to
NERSC’s production workloads on current and future systems.