Large-scale data analytical applications such as social network analysis and web analysis have revolutionized modern computing. The processing demand posed by an unprecedented amount of data challenges both industrial practitioners and academia researchers to design and implement highly efficient and scalable system infrastructures. However, Big Data processing is fundamentally limited by inefficiencies inherent with the underlying programming languages.
While offering several invaluable benefits, a managed runtime comes with time and space overheads. In large-scale systems, the runtime system cost can be easily magnified and become the critical performance bottleneck. Our experience with dozens of real-world systems reveals the root cause is the mismatch between the fundamental assumptions based on which the current runtime is designed and the characteristics of modern data-intensive workloads.
This dissertation consists of a series of techniques, spanning programming model, compiler, and runtime system, that can efficiently mitigate the mismatches in real-world systems, and hence, significantly improve the efficiency of various aspects of Big Data processing. Specifically, this dissertation makes the following three contributions. The first contribution is the development of a framework named Facade aiming to reduce the cost of object-based representation without an intrusive modification of a JVM. Facade uses a compiler to generate highly efficient data manipulation code by automatically transforming classes in such a way that objects are created only as proxies. Facade advocates for the separation of data storage and data manipulation. The execution model enforces a statically-bounded total number of data objects in an application regardless of how much data it processes. The second contribution is the design and implementation of Yak, the first hybrid garbage collector tailored for Big Data systems. Yak provides high throughput and low latency for all JVM-based languages by adapting its algorithms to two vastly different types of object lifetime behaviors in Big Data applications. Finally, the third contribution is a JVM-based alternative to enable instantaneous data transfer across nodes in clusters called Skyway. Skyway optimizes away inefficiencies of relying on reflection – a heavy runtime operation, and handcrafted procedures in converting data format by transferring objects as-is, without changing much of their existing format.
We have extensively evaluated those compiler and runtime techniques in several real-world, widely-deployed systems. The results show significant improvement of the system over the baseline: faster execution, reduced memory management costs, and improved scalability. The techniques are also highly practical and easy to integrate without much user efforts, making the adoption in real setting possible. The work has inspired a line of several follow-up work from academia. Moreover, the Yak system has been adopted by a telecommunication company.