Dogga, Pradeep

Towards Cloud-Scale Debugging

2024

Abstract

Cloud computing is an integral part of today's world: it primarily enables individuals and enterprises to provision and manage resources such as compute, storage, etc., for their needs with the click of a button. Modular approach to software development enabled cloud providers to rapidly evolve and deliver increasing number of services to users rendering clouds mission-critical. To insure prompt serviceability of this Achilles’ Heel from facing incidents, cloud providers employ significant human resources. However, with the ever increasing number of services offered by clouds and growing types of workloads such as the proliferation of Machine Learning workloads in recent times, it is no longer viable for cloud providers to scale their human resources at this pace to insure prompt serviceability of their clouds.

In this dissertation, I present my work towards improving the serviceability of clouds by leveraging insights from my experience with real debugging workflows employed at the three largest clouds today. I present techniques from Machine Learning and Natural Language Processing to leverage the vast amount of historical debugging data in clouds to develop tools that provide assistance to their engineers. I present a 'Coarsening' framework that enables transition towards a centralized debugging plane and discuss practical evaluations of tools built using this framework.

I present Revelio, a tool that can generate debugging queries for engineers to execute over system-wide logged data, whose results can likely hint them of the root cause of an incident. To enable benchmarking many techniques, I also built a distributed systems debugging testbed that can inject faults into services, interface with human users and collect execution logs across the system. I present AutoARTS, a tool that can tag a lengthy postmortem report of an incident in the cloud with all root causes from an extensive taxonomy and can also highlight key pieces of information from a postmortem for ease of analysis. I present PerfRCA, a tool that can scale causal discovery to production-scale telemetry to reason performance degradations. I conclude with my vision for a centralized approach to automatically extract generalizable debugging assistance to engineers across a cloud.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Towards Cloud-Scale Debugging