Lin, Yiming

Optimizing Query Processing for Data-Intensive Computation

2023

Lin, Yiming
Advisor(s): Mehrotra, Sharad

Abstract

Today, data-driven analysis and applications exploit vast streams of data that are perpetually generated and collected from numerous data sources. Such a surge in data production, which has reached over 10,000 Exabytes, is driving transformative advancements in sectors such as transportation, emergency services, and health and wellness. Before data is queried and used by downstream data-driven analytical tasks, various computationally-intensive computations often need to be performed. Such tasks include data cleaning, data integration, and/or data enrichment operations that often execute expensively AI/ML models incur non-trivial costs. Such computationally expensive tasks can often not be performed at data ingestion time due to the rate at which data is produced. Periodic, offline computation is also infeasible due to the volume of data. Query processing in such a situation requires careful incorporation and co-optimization of computationally expensive operations into the query engine that can streamline query analysis and enhance execution efficiency in terms of time and resources.

The goal of this thesis is to develop mechanisms to support computationally expensive operations (e.g., enrichment, imputations, information extraction, data interpretation) within data management systems in order to support interactiveanalysis. While the techniques developed in the thesis have wide applicability, our focus is on emerging smart space applications. Smart spaces consist of sensor-embedded physical spaces that capture and represent the dynamic state of the physical infrastructure, and that of people interacting with the physical infrastructure, and with each other.

Data management in smart spaces opens several new challenges one of which is the ability to support interactive analytics on very large volumes of data being captured at large velocities. The problems studied in this thesis draw its motivation from such challenges. In Chapter 3 we develop a query-time missing value imputation framework, entitled LaZy Imputation during query Processing (ZIP), which modifies relational operators to be imputation-aware in order to minimize the joint cost of imputing and query processing. The modified operators use a cost-based decision function to determine whether to invoke imputation or to defer it to downstream operators to resolve missing values. The modified query processing logic ensures results with deferred imputations are identical to those produced if all missing values were imputed first. ZIP includes a novel outer-join-based approach to preserve missing values during execution, and a bloom filter-based index to optimize the space and running overheads. Extensive experiments on both real and synthetic data sets demonstrate orders-of-magnitude improvements in query processing when ZIP-based deferred imputations are used.

In Chapter 4, we present a system for automated Predicate LeArning at QUery timE (PLAQUE), that automatically infers new predicates while running queries in order to accelerate query execution. PLAQUE represents a significant departure from prior work on learning predicates which are either limited to queries containing selection conditions on certain columns (e.g., columns involved in a join in the query), or require statistics to be collected and maintained from data, such as range set. We identify several opportunities to learn predicates from various query conditions, such as aggregation, equi join, theta join, and group by/having conditions at query time. In PLAQUE, learned predicates are pushed down to the optimal positions in a given query plan tree in order to maximize their benefits. A novel partial-order-based approach is developed for such a purpose. Furthermore, we introduce a pre-learning technique for predicate inference before query optimization, which synergistically combines with the runtime learning approach of PLAQUE to further enhance performance. Comprehensive evaluations on both synthetic and real datasets demonstrate that our learned predicates accelerate query execution by an order of magnitude, and the improvements are even higher (two orders of magnitude) when computationally expensive operators (imputations/enrichment) in the form of User-Defined Functions (UDFs) are utilized in queries. PLAQUE, thus, significantly benefits data-driven analytical applications. In Chapter 5, we shift the interest to applying techniques developed in the thesis to data processing and query analytics in smart spaces. We first develop LOCATER, an indoor localization solution based on WiFi connectivity data. LOCATER is zero-cost, accurate (90% accuracy), and passive without the need to install any new hardware in the building or new software on users' phones. LOCATER has already been deployed and is operational in the USA and India, across three distinct locations (UCI, BSU, Plaksha), and in over 40 UCI buildings for four years. LOCATER serves as a representative compute-intensive task in building smart space applications. We conduct a case study by building two applications using LOCATER, occupancy, and contact tracing applications. Our case study clearly demonstrates the benefits of using our query processing techniques -- ZIP and PLAQUE in building campus-scale smart space applications. For instance, we show queries without these optimizations, which were impractical to execute interactively can be used for interactive analytics.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Irvine

Optimizing Query Processing for Data-Intensive Computation