This paper presents a new checkpointing algorithm for systems using reliable communication channels. The new algorithm requires O(n + m) communication messages, where n is the number of participating processes, and m is the number of late messages. The algorithm is non-blocking, requires minimal message logging, and has minimal stable storage requirements. This algorithm is also scalable, simple, transparent to the user, and facilitates fast recovery. By introducing sutable delay in the checkpointing process, the parameter m can be made small. We also describe a variant of the algorithm that requires only O(n) messages, at a cost of O(n) additional storage for each process.
This paper also presents an efficient coordination mechanism, called the Process Order. The Process Order mechanism can be used for grouping processes in arbitrary structures in order to solve various problems, including scalability, failure detection, and coordinator election. The Process Order merchanism groups the processes transparent to the user, and automatically adjusts to the changes in system topology.