Challenges, Opportunities, and Solutions for Next Generation Data Analytics Systems
- Author(s): Thomas, Shelby
- Advisor(s): Porter, George
- et al.
Data on the internet is increasing at an exponential rate. The arms race to build and capitalize on this information has led to an increase in raw horsepower thrown at the problem. Datacenters have gotten larger in both size and in number, hardware has been built that promises 10x improvement over the previous generation’s model and software frameworks have been amended and repurposed rather than redesigned. This has led to diluted returns on investments when it comes to performance. Large improvements on just one part of a system often do not translate directly to others. This culminates with wasted cores, cycles, and costs. In this thesis I look at new ways to design for the data problem in a holistic way. I analyze current trends in CPUs, network interconnects, and software frameworks and find opportunities where existing systems can improve by working in tandem.
In the case of network speeds, I surface nuances around the interaction of the network card, memory, and application. I discovered a new insight, Dark Packets, which shows that the outsized gains in network speeds cannot be taken advantage of without improvements to DRAM bandwidth or increased core counts. In the case of software frameworks, I find that kernel network stacks today are not suited to the fan-out architecture found in today’s data analytic frameworks.
Finally, I take the insights learned from these papers and build a new general purpose serverless burst-parallel data analytics framework, SAL, that is portable, general purpose, and performant. The goal of this work is to demonstrate that holistic full stack approaches to building data analytic systems results in lower cost and higher performance.