Approximation and Search Optimization on Massive Data Bases and Data Streams
- Author(s): ZENG, KAI
- Advisor(s): Zaniolo, Carlo
- et al.
A fast response is critical in many data-intensive applications,
including knowledge discovery analytics on big data,
and queries searching for complex patterns in sequences, data streams and graphs.
Moreover, the volume of data and the complexity of the analytical tasks
they must support are now growing at such a torrid rate that
the vigorous progress in performance and scalability of computer systems
cannot keep up with it.
This situation calls for
(i) effective optimization techniques to reduce the cost of complex pattern queries, and
(ii) approximation techniques to produce results of predictable accuracy
using a small subset of the data.
In this dissertation we
(i) introduce new query languages and optimization techniques
for pattern matching in sequences, data streams and graphs, and
(ii) formulate a general approximation model for analytics queries.
Thus, in this dissertation we have made the following contributions:
(i) We have designed and demonstrated optimized implementation techniques
for K*SQL and XSeq, which provide a unified framework for complex pattern searching
on relational and XML DBs, respectively.
In particular, we introduced efficient execution exploiting recent advances
in automata theory known as Nested Words.
(ii) We have designed and demonstrated efficient scalable graph search engine
based on novel distributed memory-based system architecture,
and exploit graph exploration operations for implementing
an efficient graph search algorithm.
(iii) We have introduced support for bootstrap methods in MapReduce.
Bootstrap is a very useful estimation technique for sampling-based approximation.
Thus we designed the EARL of Hadoop system, that facilitates and optimizes
the use bootstrap methods on parallel MapReduce systems.
(iv) We have then invented and demonstrated an analytical model for bootstrap,
whereby the Monte-Carlo evaluation of the standard method
is replaced by a probabilistic query.
Thus, we provided a semiring-based extension of relational algebra and
related query optimization techniques to support fast execution
of the resulting probabilistic query.
We finally developed an Analytical Bootstrap System (ABS) for parallel
and distributed computing platforms.
ABS is applicable to most relational database queries and
delivers very accurate estimates at speeds that
outperforms the traditional bootstrap method by orders of magnitude.