Supporting Progressive Query Processing and Scalable Data Enrichment for Real time Analytic Applications
In this thesis, we propose EnrichDB, a new DBMS technology designed for emerging domains (e.g., social media analytics and sensor-driven smart spaces) that require incoming data to be enriched using expensive machine learning/signal processing functions prior to its usage. To support online processing, today, such enrichment is performed outside of the database as a static data processing workflow prior to its ingestion to the database. Such a strategy could result in a significant delay from the time when data arrives and when it is enriched and ingested into the DBMS, especially when the enrichment complexity is high. Also, enrichment at ingestion could result in wastage of resources if applications do not use/require all data to be enriched. EnrichDB’s design represents a significant departure from the above, where we explore seamless integration of data enrichment all through the data processing pipeline --- at ingestion, triggered based on events in the background, and progressively during query processing. The cornerstone of EnrichDB is a powerful enrichment data and query model that encapsulates enrichment as an operator inside a DBMS enabling it to co-optimize enrichment with query processing. The first chapter of the thesis describes this data model of the system.
In the second chapter of the thesis, we describe two implementations of the EnrichDB system. In the first implementation, we have taken a middleware-based approach where the database management system is treated as a black-box and enrichment is performed in a separate server, called the enrichment server. This is a simpler and portable implementation, since a user can utilize any DBMS as the storage system for storing the underlying data objects with a small code change. In the second implementation, we describe a layered implementation on top of PostgreSQL, where we have used the extensibility features of PostgreSQL to perform enrichment efficiently closer to the data where it resides. We used user-defined functions, stored procedures, indexes, and incremental materialized views to perform data enrichment and produce query results efficiently.
In the third part of my thesis, we present the algorithms implemented in above systems that can optimize enrichment with query processing. We have chosen the quality metric of F1-measure as the quality of set-based queries. Improving query results gradually by choosing online samples were explored in Online Approximate Query Processing systems. However, such systems did not consider data enrichment to improve the quality of query results and the underlying data was considered to be static. The goal of data enrichment in the EnrichDB system is to enrich objects in a way that can produce good quality results as soon as possible, formalized as progressive score. Experimental results on real world datasets and queries show that the algorithm performs significantly better than the traditional sampling based approaches.