Fault localization in backbone networks
- Author(s): Kompella, Ramana Rao
- et al.
Automated, rapid and effective fault management is a central goal of large operational IP networks. Yet, today's networks suffer from a wide and volatile set of failure modes, where the underlying fault proves difficult to be detected and localized. In this dissertation, we introduce a fault localization methodology based on the use of risk models. At a high level, risk modeling involves constructing a bi-partite dependency relationship between a set of observable failure symptoms and associated root causes. It then uses novel fault- localization algorithms that use the set of observed failure symptoms and the constructed risk models to output a set of candidate root causes that best explain the symptoms. Using observations from monitoring data commonly available today in ISP networks, we apply risk-modeling methodology to two different fault-localization problems-- -IP link and black hole localization---commonly observed in practice. For these two fault-localization problems, we have designed, implemented, evaluated and deployed systems in a real tier-one ISP network. Our experience indicates that risk modeling is effective in narrowing down the set of root causes of failures significantly, thus assisting network operators respond quickly to common failure modes. While our systems indicate tremendous promise in the risk- modeling approach, still, risk modeling is an indirect inference mechanism born out of necessity, especially in situations where direct isolation mechanisms do not exist. Thus, we propose composition-based architecture called m- Plane that utilizes specialized router primitives to directly isolate the location of failures that affect traffic. In addition to monitoring connectivity problems, our architecture can also generalize to localizing other end-to-end performance degradations such as delay and loss