Supporting Query-Driven Cleaning in Probabilistic Databases
Organizations collect a substantial amount of user' data from multiple sources to explore such data analytically and derive meaningful insights. One of the obstacles that prevent organizations from reaping the benefits of the analysis task is the low quality of the previously collected data. Hence, most of the data preparation time is dedicated to cleaning the data from fixing type errors to removing the uncertainty or ambiguity of some data using data cleaning techniques. A new paradigm for handling such issues is integrating the cleaning process within the query execution workflow to clean the needed tuples rather than performing the cleaning step prior to query execution on the entire dataset. In this thesis, we tackle the challenge of applying the query-driven cleaning approach in the case of probabilistic queries.First, we present TQEL, a framework that integrates the entity linking task with query processing to answer top-k entities' queries on top of a collection of social media blogs. The entity linking process removes the ambiguity of certain words in any textual snippet by linking such words to real-world entities. The TQEL framework offers two variants: TQEL-exact and TQEL-approximate, that retrieve the exact/approximate top-k results. TQEL-approximate, using a weaker stopping condition, achieves significantly improved performance (with the fraction of the cost of TQEL-exact) while providing strong probabilistic guarantees (over two orders of magnitude lower EL calls with a 95% confidence threshold compared to TQEL-exact). Subsequently, we propose TQELX, a framework that generalizes the previous approach to support multiple aggregation functions and other group-based aggregation queries. TQELX is an analysis-aware cleaning for probabilistic queries that use the approximate confidence computation technique. TQELX tightly incorporates the cleaning step in multiple stages of the Monte-Carlo simulation execution to return the results as quickly as possible. We compare our approach against multiple probabilistic query answering baselines and show that TQELX outperformed them in total execution times. Lastly, we discuss the incremental view maintenance problem in probabilistic databases and provide a solution to speed up the execution process in the case of database updates. Naively, if a tuple is inserted, deleted or updated, the previously computed results become obsolete, requiring the query's re-execution. In cases where the query uses approximate confidence computation techniques, the overhead of such process incurs overheads and unacceptably delays the overall execution experience. We implement PIVM, a solution built on top of PostgreSQL to support delta computation techniques efficiently. Our experiments demonstrate the effectiveness of using such an approach on multiple select-project-join queries and report that PIVM offers a massive execution speed-up in the case of updates.