Revisiting Aggregation Techniques for Data Intensive Applications
Aggregation has been an important operation since the early days of relational databases. Today's Big Data applications bring further challenges when processing aggregation queries, demanding robust aggregation algorithms that can process large volumes of data efficiently in a distributed, share-nothing architecture. Moreover, aggregation on each node runs under a potentially limited memory budget (especially in multiuser settings). Despite its importance, the design and evaluation of aggregation algorithms has not received the same attention that other basic operators, such as joins, have received in the literature.
This dissertation revisits the engineering of efficient aggregation algorithms for use in Big Data platforms, considering both local and global aggregations. We firstly discuss the salient implementation details and precise cost models of several candidate local aggregation algorithms and present an in-depth experimental performance study to guide future Big Data engine developers. We show that the efficient implementation of the local aggregation operator for a Big Data platform is non-trivial and that many factors, including memory usage, spilling strategy, and I/O and CPU cost, should be considered. Then we show extended cost models that can precisely depict the cost of global aggregation plans, considering not only the local cost factors but also the network cost. We discuss a generic framework to describe a tree-structured global aggregation plan, and propose a cost-model based algorithm for efficiently finding the non-dominated global aggregation plans for different output properties, given the input data statistics and the global computation resources.
Furthermore, spatial and temporal information introduces more semantics to traditional aggregations, requiring specific efficient algorithms that could utilize the additional spatial and temporal information during the query processing. In the last part of the thesis we show a novel aggregation application for monitoring the top-k unsafe moving objects in a continuous data stream where the spatial and temporal information change. We show that such a query can be generalized as an extended aggregation operation where the grouping condition is unknown without looking at all valid data in the given query time window. We then propose I/O-efficient algorithms to answer such queries utilizing spatial and temporal index structures.