The proliferation of distributed internet services has reaffirmed the need for reliable and high-performance networks, not only in the WAN bringing users to the services, but within the datacenters where services themselves reside. Services consist of distributed applications running across thousands of servers within datacenters, with stringent performance, scaling and reliability requirements. To support these requirements, datacenter networks are comprised of thousands of servers, links and ports and over hundreds of switches providing multiple paths between any pair of servers. Because any given component has a small but non-zero failure rate, the large number of components means that failures are endemic inside datacenters. Unfortunately, not all failures are easily diagnosable within datacenter environments.
In particular, datacenters are susceptible to insidious parasitic performance loss due to a class of network component fault known as partial faults—where a component is nominally healthy, but intermittently drops or delays traffic. These faults have been noted as being particularly difficult to detect and localize, though mitigation can be straightforward once the faulty component is determined. Pinpointing partial faults quickly is crucial, because they are capable of inflicting a disproportionately high toll on application performance.
Unfortunately, partial faults can confound existing fault detection methods in several ways, including interactions between the fault itself, application traffic characteristics, and networking hardware. For example, network switches may fail to detect a fault due to unreliable or otherwise insensitive monitoring capabilities. Traffic volume and variability may complicate analysis of server-based application and network metrics, as well as mask fault impact. Moreover, the myriad paths available to network flows complicate localization even if servers do detect partial faults.
However, this work shows that the scale and regular design of contemporary datacenters can simplify partial-fault localization. In particular, the combination of large-scale load-balanced multipath topologies and high-volume datacenter traffic enables simple, low-overhead, application-agnostic, and root-cause-agnostic partial-fault localization via passive, link-by-link outlier analysis of application network performance. I validate the effectiveness of my approach within large-scale first-party production datacenters, and examine the additional challenges and complexities raised by third-party cloud datacenters.