Skip to main content
eScholarship
Open Access Publications from the University of California

The design and implementation of Berkeley Lab's linux checkpoint/restart

Abstract

This paper describes Berkeley Linux Checkpoint/Restart(BLCR), a linux kernel module that allows system-level checkpoints on a variety of Linux systems. BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointing and restoring parallel jobs running on multiple machines. Integration with Message Passing Interface (MPI) and other parallel systems is described.

Main Content
Current View