Toward Understanding and Dealing with Failures in Cloud-Scale Systems
In cloud-scale systems, fault is a fact of life. To tolerate faults and provide highly-available service is arguably the single most important task for cloud builders. Yet, despite the considerable efforts into fault-tolerance and software engineering for reliability, all cloud scale services continue to experience costly failures. A natural question to ask is: why do cloud-scale services still fail despite the abundant fault-tolerance and how we can further improve? This thesis attempts to shed light on this question.
In the first part of this thesis, we study a set of 34 publically disclosed cloud service outages that we gathered and consider them from the point of view of fault-tolerance mechanisms. We present a novel taxonomy to categorize why the mechanisms may be ineffective; it includes faults that cannot be handled by replication, insufficient redundancy, and undetected faults. We also explore the root causes of failures, and investigate the interactions of system components in failures that were caused by multiple faults.
We find that, in many cases, while cloud systems are robust to tolerate traditional faults, they are fragile under misconfiguration, which is a major source of service unavailability. To further improve cloud service quality, it is crucial to reduce misconfiguration.
In the second half of this thesis, we propose a framework, ConfValley, to systematically validate configuration and catch errors before production. At the core of ConfValley is a language called CPL to allow experts to express configuration specifications declaratively. To further reduce operators' burdens of writing configuration specifications, our framework also includes a component to automatically infer specifications.
We evaluate ConfValley in a leading cloud service provider, Microsoft Azure, on its various types of configuration data. We rewrite existing configuration validation code in Microsoft Azure in CPL with more than 10x fewer lines of code. The framework also automatically infers thousands of CPL specs with high accuracy. With the translated and automatically generated specifications, we prevented a number of configuration errors from rolling out in production in Microsoft Azure.