Queries with Bounded Errors & Bounded Response Times on Very Large Data
- Author(s): Agarwal, Sameer
- Advisor(s): Stoica, Ion
- et al.
Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions. The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at human-interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can be provided quickly to the end user. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. In this thesis, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses three key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements, and (3) an error estimation and diagnostics module that produces approximate answers and reliable error bars. We evaluate BlinkDB extensively against well-known database benchmarks and a number of real-world analytic workloads showing that it is possible to implement an end-to-end query approximation pipeline that produces approximate answers with reliable error bars at interactive speeds.