Lawrence Berkeley National Laboratory
The design and implementation of Berkeley Lab's linux checkpoint/restart
- Author(s): Duell, Jason
- et al.
This paper describes Berkeley Linux Checkpoint/Restart(BLCR), a linux kernel module that allows system-level checkpoints on a variety of Linux systems. BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointing and restoring parallel jobs running on multiple machines. Integration with Message Passing Interface (MPI) and other parallel systems is described.