Skip to main content
Open Access Publications from the University of California

Accelerating Analytic Queries on Compressed Data

  • Author(s): Lin, Chunbin
  • Advisor(s): Papakonstantinou, Yannis
  • et al.

Data compression techniques (both lossless and lossy compression methods) are widely utilized in big data analytic applications in domains including health-care, transportation, and finance. The main benefit achieved from applying data compression techniques is the saving of space cost. However, performing analytic queries on compressed data has two major challenges in terms of the performance and the accuracy: (i) decompressing data may damage the performance, and (ii) if lossy data compression techniques are utilized then the returned answers are not accurate. In this dissertation, we study how to accelerate analytic queries over compressed data (and provide tight error guarantees for approximate answers if lossy data compression methods are applied).

First, this dissertation introduces an in-memory database, called GQ-Fast, which supports high-performance analytic SQL queries over compressed relational data. We study a class of graph analytic SQL queries, called relationship queries. These queries involving aggregation, join, semijoin, intersection and selection are a wide superset of fixed-length graph reachability queries and of tree pattern queries. We present real-world OLAP scenarios, where efficient relationship queries are needed. However, row stores, column stores and graph databases are unacceptably slow in such OLAP scenarios. To support efficient relationship queries, we propose a GQ-Fast database, which is an indexed database that roughly corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing and compression. GQ-Fast uses a bottom-up fully pipelined query execution model, which enables (i) aggressive compression (e.g., compressed bitmaps and Huffman) and (ii) avoids intermediate results that consist of row IDs (which are typical in column databases). In addition, GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access. GQ-Fast outperforms the state-of-the-art databases by 1-3 orders of magnitudes and GQ-Fast uses less space.

Second, this dissertation proposes an approximate query processing system, called Plato, which supports anytime analytic queries over compressed time series with sound and tight deterministic error guarantees. Plato supports expressions that are compositions of the linear algebra operators over vectors along with arithmetic operators. Such analytics can express common statistics (such as correlation and cross-correlation) that may combine multiple time series. Plato builds a compressed segment tree structure for each time series during the offline insertion time. Each node in the tree refers to a time series segment. Each segment (i) is compressed by an estimation function that approximates the actual values and is coming from a user-chosen estimation function family, and (ii) is associated with one to three (depending on the case) precomputed error measures. Then Plato is able to provide tight deterministic error guarantees satisfying the error budgets provided by users by accessing the minimal number of nodes in the segment tree. In addition, we also identify two broad estimation function family groups. The Vector Space (VS) family and the presently defined Linear Scalable Family (LSF) lead to theoretically and practically high-quality guarantees, even for queries that combine multiple time series that have been independently compressed. Well-known function families (e.g., the polynomial function family) belong to LSF. The theoretical aspect of "high quality" is crisply captured by the Amplitude Independence (AI) property: An AI guarantee does not depend on the amplitude of the involved time series, even when we combine multiple time series.

Main Content
Current View