In recent years, big data applications often involve dealing with diverse datasets in terms of structure: relations flat or nested, complex-structure graphs, documents (JSON or XML), poorly structured logs, or even text data. To handle the heterogeneity of the data, application designers usually rely on several data stores used side-by-side, each supporting a different data model, associated query language (or data access API), and very efficient for some, but not all, kinds of processing on the data. Systems capable of querying disparate data in this fashion are advocated by the database community under terms such as hybrid- or poly-stores.
These systems provide no support for semantic query optimizations, which include (i) exploiting possible data redundancy when the same data may be accessible (with different performance) from distinct data stores; (ii) taking advantage of partial query results (in the style of materialized views), which may be available in the stores; and (iii) reasoning semantically about various data models and query operations’ properties, which can enhance the hybrid workload performance. Motivated by these optimization opportunities, this dissertation makes the following two main contributions:
We design and demonstrate ESTOCADA, an extensible lightweight framework for providing semantic query optimization on top of hybrid stores without modifying their internals. ESTOCADA transparently enables each query to benefit from the best combination of stored data and available processing capabilities. It leverages recent advances in the area of view-based query rewriting under constraints, which we use to describe various data models and stored data. We demonstrate the effectiveness of our approach with an experimental evaluation using the MIMIC real-world dataset and show significant performance gains achieved by ESTOCADA.
Going beyond query workloads covering a variety of data models (relational, JSON, Graph, XML) in hybrid stores, modern applications increasingly need to blend querying and learning on the data, which is primarily expressed using a mix of relational algebra (RA)- and linear algebra (LA)-based languages. Existing specialized solutions for evaluating such hybrid analytical tasks either optimize RA and LA tasks separately, exploiting only RA properties while leaving LA-specific optimizations unexploited, or focus heavily on physical optimizations, leaving semantic query optimization opportunities unexplored. In our second contribution, we take a major step towards filling this gap by proposing HADAD. The novelty of HADAD is to extend the benefits of semantic query rewriting and view-based optimization introduced in ESTOCADA to LA computations, crucial for ML hybrid analytical tasks. Our solution can be naturally and portably applied on top of pure LA and hybrid RA-LA platforms. An extensive empirical evaluation shows that HADAD yields significant performance gains on diverse workloads, ranging from LA-centered to hybrid RA-LA workloads.