Wide-area systems are gaining in popularity as an infrastructure for
running scientific applications. From a fault tolerance perspective, these
environments are challenging due to their scale and their inherent variability.
Causal message logging protocols have attractive properties that make them
suitable for these environments. They spread fault tolerance information around
in the system providing high availability. This information can also be used to
replicate objects that are otherwise inaccessible due to network partitions.
However, current causal message logging protocols do not scale to thousands or
millions of processes. We describe the Hierarchical Causal Logging Protocol
(HCML) that uses a hierarchy of shared logging sites, or
proxies, to
reduces the space requirements exponentially. These proxies also act as caches
for fault tolerance information and reduce the overall message overhead of
causal message logging protocols by as much as 50%. In addition, HCML
leverages differences in bandwidth between communicating processes by
piggybacking more fault tolerance information over high bandwidth links. Doing
so improves overall message latency by as much as 97%.
Pre-2018 CSE ID: CS2001-0671