Skip to main content
eScholarship
Open Access Publications from the University of California

Approximation and Search Optimization on Massive Data Bases and Data Streams

  • Author(s): ZENG, KAI
  • Advisor(s): Zaniolo, Carlo
  • et al.
Abstract

A fast response is critical in many data-intensive applications,

including knowledge discovery analytics on big data,

and queries searching for complex patterns in sequences, data streams and graphs.

Moreover, the volume of data and the complexity of the analytical tasks

they must support are now growing at such a torrid rate that

the vigorous progress in performance and scalability of computer systems

cannot keep up with it.

This situation calls for

(i) effective optimization techniques to reduce the cost of complex pattern queries, and

(ii) approximation techniques to produce results of predictable accuracy

using a small subset of the data.

In this dissertation we

(i) introduce new query languages and optimization techniques

for pattern matching in sequences, data streams and graphs, and

(ii) formulate a general approximation model for analytics queries.

Thus, in this dissertation we have made the following contributions:

(i) We have designed and demonstrated optimized implementation techniques

for K*SQL and XSeq, which provide a unified framework for complex pattern searching

on relational and XML DBs, respectively.

In particular, we introduced efficient execution exploiting recent advances

in automata theory known as Nested Words.

(ii) We have designed and demonstrated efficient scalable graph search engine

based on novel distributed memory-based system architecture,

and exploit graph exploration operations for implementing

an efficient graph search algorithm.

(iii) We have introduced support for bootstrap methods in MapReduce.

Bootstrap is a very useful estimation technique for sampling-based approximation.

Thus we designed the EARL of Hadoop system, that facilitates and optimizes

the use bootstrap methods on parallel MapReduce systems.

(iv) We have then invented and demonstrated an analytical model for bootstrap,

whereby the Monte-Carlo evaluation of the standard method

is replaced by a probabilistic query.

Thus, we provided a semiring-based extension of relational algebra and

related query optimization techniques to support fast execution

of the resulting probabilistic query.

We finally developed an Analytical Bootstrap System (ABS) for parallel

and distributed computing platforms.

ABS is applicable to most relational database queries and

delivers very accurate estimates at speeds that

outperforms the traditional bootstrap method by orders of magnitude.

Main Content
Current View