Scalable, efficient, and fault-tolerant data center networking
- Author(s): Walraed-Sullivan, Meg
- Walraed-Sullivan, Meg
- et al.
The advent of cloud computing and the expectation of anytime availability of user data and services have brought data center design to the forefront of computer science research. Modern data centers can be massive in size, consisting of hundreds of thousands of servers and millions of virtualized end hosts. At this scale and complexity, the underlying network becomes central to data center scalability, efficiency, availability and fault tolerance. Given the scale of today's data center networks, operators typically turn to symmetric, highly structured network topologies, sacrificing flexibility for relative simplicity. These topologies tend to have an "all or nothing'' tradeoff between fault tolerance and scalability. Over these topologies, data center operators often run protocols borrowed from the Internet, an environment that is drastically different from that of the data center. Because these protocols have not been built for the data center, they can operate and interact in unexpected and undesirable ways. Moreover, they are generally vetted by virtue of having survived in the Internet, rather than by formal reasoning. This makes the management burden associated with configuration, maintenance and error diagnosis for these protocols substantial, leading to compromised efficiency and availability. The first contribution of this dissertation is the introduction of a new class of network topologies called Aspen trees. Aspen trees provide the high throughput and path multiplicity of current data center network topologies while also allowing a network operator to select a particular point on the scalability versus fault tolerance spectrum. This addresses the challenge of supporting simultaneous scalability and fault tolerance in data center networks. Next, the challenge of providing scalable and efficient communication is addressed with the design of ALIAS, a protocol for scalable, automatic and decentralized addressing and communication in the data center. Finally, this dissertation presents a formalization and proof of correctness of the fundamental building block of ALIAS, thus enabling feasible configuration and maintenance of ALIAS in the data center. This combination of tunable topology structure and tailored communication protocols enables scalable, efficient and fault-tolerant data center communication