In the world of data analytics, domain experts, such as public health scientists and medical researchers, play a crucial role as their domain knowledge can unlock valuable insights from data. However, they face several challenges in the current landscape of data analytics tools. They often lack the technical skills necessary to analyze large datasets, requiring collaboration with technical experts who may not have relevant domain knowledge. Moreover, when processing large volumes of data, the execution times can be lengthy, and non-technical users are left in the dark without feedback.
Over the past six years, our team has been developing Texera, a workflow-based data analytics system specifically designed to enable non-technical users to perform data analytics tasks with ease by promoting seamless collaboration and responsive interactions. Texera enables multiple users to collaboratively construct workflows, offering an experience similar to that of Google Docs and Overleaf. Furthermore, Texera allows users to interact with the workflow execution, enabling them to pause/resume workflows, inspect execution states, and modify logic as needed.
In this thesis, we first present an overview of the Texera system in Chapter 2, discussing the design choices and the associated tradeoffs of several key components within Texera that enable these powerful features of real-time collaborations and user interactions. Following this, in Chapter 3, we explore a specific use case of user interaction: modifying the logic of operators in a workflow, also referred to as reconfigurations. We develop an algorithm called Fries, which can schedule these reconfigurations with minimal delay while maintaining transactional guarantees, particularly when a reconfiguration involves multiple operators. In Chapter 4, we shift our focus to incremental data processing, as Texera uses progressive computation to deliver early results to users. We present Tempura, a cost-based optimization framework designed for incremental processing. As a general framework, Tempura can support various incremental computation requirements for many different applications and use cases even beyond Texera's scope. Tempura can select the best incremental computation plan based on the specific query and data involved. In Chapter 5, we conclude this thesis and discuss future work.