Big Data systems such as Hadoop and Spark enable companies and people to process huge data processing workloads that otherwise would be impossible to complete. Many thousands of applications run atop each of these systems, making it all the more important for them to be both performant and maintainable.
The development of these systems is no small feat. They handle issues such as cluster management and load balancing so that users of the systems can focus on the data processing problems instead. While this is good for the end user, it means that these systems’ codebases are very complex, and often make some concessions (such as choosing a managed programming language) in order to mitigate this complexity in development.
This dissertation aims to improve these systems with the idea of optimistic optimizations. This means that we aggressively try to perform optimizations even if they may sometimes be unsound, or cause slowdowns, when they will be beneficial overall. We apply this idea in three works in this direction, to improve different aspects of Big Data systems.
First, we present a Java-based compiler and runtime system named Gerenuk, which transforms a Big Data program to use inlined native bytes, rather than objects, to achieve end-to-end speedups and improved memory usage. We show that while this transformation does not work in the general case, because of the typical behavior of objects in this context, most objects will see a benefit, and we show a recovery technique for objects that behave differently.
Second, we present a learning technique to predict profiling data of applications, to perform profile-guided optimization across a Big Data cluster, without profiling it in its entirety. Given a high enough model accuracy, we can predict how objects typically behave, based on their allocation sites. However, based on the confidence of the model, a system could choose to not perform the optimization.
Finally, we present a learning technique to predict context-sensitive points-to information from context-insensitive information. These predictions are much faster than calculating context-sensitive information, and can be useful in scaling these analyses to the size of these systems’ codebases. While these predictions could be wrong, based on a user’s needs, we can provide an insensitive solution, run the real (slow) analysis, or adjust the model based on required expected precision and recall.