- Main
Providing Efficient Fault Tolerance in Distributed Systems
- Zhuang, Siyuan
- Advisor(s): Stoica, Ion;
- Song, Dawn
Abstract
The exponential growth in data and computational demands is transforming the approach to system design, particularly in tackling large-scale problems such as training large language models. This shift necessitates the widespread adoption of distributed systems. Simultaneously, systems and applications are becoming increasingly heterogeneous and sophisticated. In this evolving landscape, a critical challenge arises: supporting a wide range of distributed applications while simultaneously achieving computational efficiency and fault tolerance.This thesis explores the development of universal distributed systems that provide efficient fault tolerance for modern applications. The key idea is to exploit the semantics of workloads at all layers of distributed systems. At the communication layer, we introduce Hoplite, a distributed object store that dynamically exploits data transfer patterns and employs fine-grained pipelining to gain efficiency. Hoplite also reschedules tasks to mitigate the effects of failures. At the task execution layer, ExoFlow leverages the semantics of tasks and data passing between tasks to separate execution and recovery units within workflow systems. This approach ensures exactly-once failure recovery semantics while minimizing checkpointing overhead. Together, these contributions demonstrate a full-stack approach to building universal, efficient, and fault-tolerant distributed systems.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-