Support for Scalable Analytics over Databases and Data-Streams
- Author(s): Laptev, Nikolay Pavlovich
- Advisor(s): Zaniolo, Carlo
- et al.
The world's information is doubling every two years, largely due to a tremendous growth of data from blogs, social medias and Internet searches. `Big Data Analytics' is now recognized as an emerging technology area of great opportunities and technical challenges. Parallel systems, such as those inspired by MapReduce architectures, provide a key technology to cope with those challenges---however they often cannot keep up with the fast-growing size of data and application complexity, nor can they deliver the response times required by data stream applications. In this thesis, therefore, we show that many of said limitations can be overcome by building on classical approximation techniques from statistics to estimate (i) the sample quality and (ii) the required sample size given the user-prescribed accuracy.
To achieve (i) we look into the bootstrap theory. The bootstrap approach, based on resampling, provides a simple way of assessing the quality of an estimate. The bootstrap technique, however, is computationally expensive, thus our first contribution involves making the bootstrap estimation efficient. Following our initial results, we realized that in a distributed environment the cost of transferring the data to independent processors as well as the cost of computing a single resample can be high for large samples. Furthermore the lack of a scalable support for the popular time-series data was also a problem. For these reasons, we provide an improved bootstrap approach that uses the Bag of Little Bootstraps (BLB) along with other recent advances in bootstrap and time-series theory to provide an effective Hadoop-based implementation for assessing a time-series sample quality.
To achieve (ii) we look into the data complexity and learning theory. Recently it has been shown that the performance of a classifier can be analyzed in terms of the data complexity. We start by analyzing how model complexity can be used to create a scalable pattern matching automaton. We then extend our findings to other algorithms where we explain how problem complexity affects the required sample size for a given machine-learning algorithm and accuracy requirement. We also use the learning theory to estimate the error convergence rate needed for sample size estimation. Our experimental results provide the motivation for further exploring these ideas.
A spectrum of classical data mining tasks and newly developed mining applications are used to validate the effectiveness of the proposed approaches. For example, extensive empirical results on a Twitter dataset show that the proposed techniques provide substantial improvements in processing speeds while placing the user in control of the result accuracy.