As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A common objective pursued by these traditional cluster-based big data processing frameworks is high performance, which often means low end-to-end execution time or latency.
A typical user of these frameworks submits a job to the framework and waits for the results for minutes, hours or even days based on the size of input data and complexity of the job. There is often a need to interact with an executing job to check its states or modify parts of the job. Traditional big data processing frameworks offer little insight into an executing job. They provide simple statistics such as data size input into and processed by various operators of a job, which may not be enough information for the user.
The widespread adoption of data analytics has led to a call to improve the traditional ways of big data processing. There have been demands for making the analytics process more interactive and adaptive, especially for long running jobs. A typical data analytics workflow undergoes multiple iterations of refinement to become the final workflow that performs a task correctly. While performing these iterations, a data analyst is more interested in seeing the first few results quickly than the total execution time. If the results are undesirable, the analyst can terminate the workflow without waiting for it to execute completely. This underlines the importance of initial results in the iterative process of data wrangling and motivates a result-aware approach to big data analytics.
This dissertation is motivated by these calls for improvement in data processing and the experiences over the past few years while working on the Texera project, which is a collaborative data analytics service being developed at UC Irvine. Texera is a GUI-based service that allows the users to drag-and-drop operators to create workflows that can be executed on computing clusters. This dissertation mainly consists of three parts. The first part is about the design of the Amber engine that serves as the backend data processing framework for the Texera service. Amber supports interactivity and adaptivity during data analysis. A key feature of Amber is the existence of fast control messages that allow the interaction and adaptation to happen with sub-second latency. The second part is about an adaptive and result-aware skew-handling framework called Reshape. Reshape uses fast control messages to implement iterative skew mitigation techniques for a wide variety of operators. The mitigation techniques in Reshape have also been analyzed from the perspective of their effects on the results shown to the user. Reshape is also capable of self-tuning its threshold parameter to lessen the technical burden on the users. The last part is about a result-aware workflow scheduling framework called Maestro. This part talks about how to schedule a workflow for execution on computing clusters and make result-aware decisions while doing so. This work improves the data analytics process by bringing interactivity, adaptivity and result-awareness into the process.