Interval Joins for Big Data
- Author(s): Carman, Jr., Eldon Preston
- Advisor(s): Tsotras, Vassilis J
- et al.
The main part of this dissertation considers how to scale interval join queries. To provide scalable query processing for such joins, we adapted five recently published overlapping interval join algorithms and modified them to work in a shared-nothing big data management system (AsterixDB) under a memory budget. We developed a cost model for each algorithm to predict when an algorithm will spill to disk (run out of memory). Our experimental evaluation shows that the cost models are accurate and can be used to pick the most efficient algorithm for the given input data. The adapted interval join algorithms are shown to scale for large datasets using both synthetic and real datasets. Finally, we further adapt these algorithms to support several new types of interval joins, specifically overlap and contains, as defined by Allen's interval algebra. We detail how to abstract the memory management from these algorithms.
As a by-product we also implemented a scalable parallel processor, namely Apache VXQuery, that extends a stack consisting of Hyracks, a parallel execution engine, and Algebricks, a language-agnostic compiler toolbox. VXQuery provides an implementation of the XQuery specifics (data model, data-model dependent functions and optimizations, and a parser). We describe the architecture of Apache VXQuery, its integration with Hyracks and Algebricks, and the XQuery optimization rules applied to the query plan to improve path expression efficiency and to enable query parallelism. An experimental evaluation using a real 500GB dataset with various XML selection, aggregation, and join queries shows that Apache VXQuery performs well both in terms of scale-up and speed-up.