Coping with dependent failures in distributed systems
- Author(s): Junqueira, Flavio
- et al.
Traditionally, fault-tolerant systems assume that failures are independent, often expressed as a threshold on the number of failures. This assumption, however, is increasingly unrealistic for current distributed systems. This dissertation presents a model of dependent failures based on two abstractions: cores and survivor sets. Cores are minimal sets of processes such that at least one process is correct in every execution. Survivor sets are minimal sets of processes such that in every execution at least one survivor set contains only correct processes. This model has both theoretical and practical use. Theoretically, this model enables more flexible designs of distributed algorithms. This flexibility often improves efficiency; for example, it enables systems to meet reliability goals with fewer processes. Practically, I apply these replication techniques for multi-site systems that tolerate site failures, and for cooperative backup systems that tolerate Internet attacks. Multi-site systems, such as grid systems, experience failures that render all the processes in a site unavailable. These failures are often caused by failures of shared resources. Under such site failures, previous availability results on replication techniques, such as quorum systems, no longer necessarily hold. Using a failure model that explicitly captures dependent failures, I develop techniques for selecting replicas and forming quorums that do have optimal availability in multi-site systems. Internet attacks by worms and viruses can infect a large number of hosts, resulting in catastrophic failures. To cope with such attacks, I propose a new approach for designing distributed systems called informed replication. Informed replication uses a model of correlated failures to exploit software diversity. To demonstrate this approach, I design and evaluate a cooperative backup service called The Phoenix Recovery System. Phoenix uses heuristics to select replicas that provide excellent reliability guarantees, result in low degree of replication, limit the storage burden on each host in the system, and lend themselves to a distributed implementation. Incorporating dependent failures into the design of systems and algorithms therefore results in important advantages such as more efficient algorithms, higher availability and efficient utilization of resources. As systems increase in size and extent, these advantages will become increasingly more effective