Automated Scalable Management of Data Center Networks
- Author(s): Niranjan Mysore, Radhika
- et al.
Data centers today are growing in size and becoming harder to manage. It is more important than ever to concentrate on management of such large networks, and arrive at simple yet efficient designs that involve minimum manual intervention. Reducing network management costs can lead to better service availability, response times and increase return on investment. In this dissertation, we focus on three aspects of data center network management, the network fabric, policy enforcement and fault localization. There are inherent challenges due to scale in each of these areas. Firstly, simple, plug-and-play networks are known not to scale, leading network operators to often stitch complex interior and exterior gateway protocols to connect large data centers. Second, network isolation policies can become too huge for network hardware to handle as the number of applications multiplexed on a single data center increase. Thirdly diagnosis can become extremely hard because of the sheer number of components interacting for a service to be successful. Localizing the fault is often left to knowledgeable operators who work together in war rooms to track down and fight problems. Such an approach can be time consuming, tedious and reduce availability. To address these challenges, we propose to compose the data center management system with these three contributions : (i) PortLand : A scalable layer 2 network fabric that completely eliminates loops and broadcast storms and combines the best elements of traditional layer 2 and layer 3 network fabrics: plug-and-play, support for scale, mobility and path diversity. (ii) FasTrak : A policy enforcement system that moves network isolation rules between server software and network hardware so that performance sensitive traffic is not subject to unnecessary overheads and latency. FasTrak enables performance sensitive applications to move into multi- tenant clouds and supports their requirements. (iii) Gestalt : A fault localization algorithm, developed from first principles, that can operate in large scale networks and beats existing localization algorithms on localization accuracy or time or both. We have prototyped and evaluated each of these systems and believe that these can be easily implemented with minor modifications to data center switches and end hosts