In the last two decades, relational databases for analytics have been specialized to address the needs of analytical workloads. For instance, analytical workloads often focus on a few attributes (or columns) instead of the whole tuple. Thus, many analytical databases have opted to store the data in a columnar layout to reduce the I/O cost. Additionally, relational databases for analytics have shifted from interpreter structures for query processing to compiling queries into executable programs that can achieve the performance of hand-written specialized programs.
Document store database systems have gained traction for storing and querying large volumes of semi-structured data without requiring the users to pre-define a schema. This flexibility allows users to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. Despite their popularity, document stores have lacked the advances made for analytics-specialized relational databases for handling analytical workloads. In fact, the redundant information in the records (``embedded schemas'') stored in document stores can introduce an unnecessary storage overhead that can render document stores even less performant than non-analytics-specialized relational databases.
Despite the performance gap between relational and document databases, many users have no choice but to use the slower yet flexible document stores. In this dissertation, we aim to optimize document stores and show that users can enjoy a similar performance of the analytics-specialized relational databases without sacrificing the flexibility of the document model. Specifically, this dissertation makes the following main contributions:
We first address the lack of schemas and the storage overhead in document stores by proposing the tuple compactor, a framework for inferring and extracting the schemas of self-describing semi-structured records. Our tuple compactor exploits the events of Log-structured Merge tree (LSM) storage, namely the flush operation, to infer and extract the schema during data ingestion. We experimentally demonstrate the efficiency of our tuple compactor on real, large datasets.
Second, we use the inferred schema and introduce two layouts for storing nested semi-structured data in columnar formats. We also propose several extensions to the Dremel format, a popular columnar format for nested data, to comply with document stores' flexible data model. Our experiments show significant performance gains, improving the query execution time by orders of magnitude while minimally impacting ingestion~performance.
Third, we present our code generation framework, which can handle the document data model's polymorphic nature and improve query execution performance in Apache AsterixDB. We analytically and empirically show that the CPU overhead of an interpreter-style execution engine can be a bottleneck even in disk-based databases like AsterixDB. However, our code generation framework can remedy the CPU overhead and significantly improve the performance of analytical queries. We also propose several improvements to AsterixDB's aggregation framework to optimize group-by queries, a crucial class of queries in analytical workloads.