Sevim, Akil

Efficient Query Processing Techniques for Data Exploration in Heterogeneous and Distributed Systems

2023

Sevim, Akil
Advisor(s): Eldawy, Ahmed

Creative Commons 'BY' version 4.0 license

Abstract

The rise in the variety and amount of big data has sparked interest in data-driven applications. As a result, there is a growing demand for effective ways to explore, transform, and understand data across various platforms, including distributed systems. Getting insights from data involves cleaning, changing, showing, and combining data, which requires scalable systems for quick knowledge extraction. This thesis introduces new techniques for processing queries in big data management systems.

Data exploration involves trial and error, often yielding empty query results. To tackle this, HQ-Filter is introduced—an agile hierarchy-aware data structure. HQ-Filter exploits data hierarchy to build a configurable, probabilistic filter, efficiently eliminating empty-result queries on the client side. Applied to UCR-Star and Cloudberry systems for spatiotemporal-textual data, HQ-Filter significantly boosts server capacity (up to 66%), accelerates response times (up to 15x), and reduces server workload (up to 90%).

Moreover, for today’s data scientists, combining diverse big datasets via distributed systems using join queries with complex conditions is essential. However, the availability of methods that can generate an optimized query plan for such queries in Database Management Systems (DBMSs) is limited due to the implementation and integration complexities. To overcome this issue, we introduce the Flexible User-defined Distributed Joins (FUDJ) framework, which seeks to enhance the availability of optimized join algorithms within DBMSs.

FUDJ enables partition-based distributed join algorithms without deep DBMS or distributed programming knowledge. Through a novel extensibility architecture, FUDJ enhances the availability and diversity of optimized join algorithms, amplifying options for data scientists and database researchers.

FUDJ facilitates query processing by embedding it in any query optimizer. Using "CREATE JOIN," FUDJ deploys join libraries, detects flexible distributed join queries, constructs optimized plans, and offers execution options. Implemented in Apache AsterixDB, FUDJ delivers substantial efficiency gains (20x less work) and speedups (up to 1200x) compared to built-in and on-top approaches.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Riverside

Efficient Query Processing Techniques for Data Exploration in Heterogeneous and Distributed Systems