In the era of big data, in addition to large local repositories and data warehouses, today’s enterprises have access to a very large amount of diverse data sources, including web data repositories, continuously generated sensory data, social media posts, clickstream data from web portals, audio/video data capture, and so on. As a result, there is an increasing demand for executing up-to-the-minute analysis tasks on top of these dynamic and/or heterogeneous data sources by modern applications. Such new requirements have created challenging new problems for traditional entity resolution, and data cleaning in general, techniques. In this thesis, we respond to some of these challenges by developing an analysis-aware approach to entity resolution.
First, we explore the problem of analysis-aware data cleaning in the context of selection queries. Specifically, we propose an “on-the-fly” data cleaning framework in the context of SQL-like selection queries. The objective of this framework is to perform the minimal number of cleaning steps that are required to answer a user query correctly. Our approach leverages the concept of vestigiality to reduce cleaning overhead. We conducted a comprehensive empirical evaluation of the proposed solution to demonstrate its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.
Subsequently, we study analysis-aware data cleaning for the more general case where queries can be complex SQL-style selections and joins. In particular, we develop a framework for integrating entity resolution techniques with query processing. The aim of this framework is to utilize the query semantics to reap the benefits of early predicate evaluation while still minimizing redundant computation in the form of data cleaning. This framework relies on the notion of polymorphic operators, which are analogous to the common relational algebra operators with one exception: they know how to test the query predicates on the dirty data prior to cleaning it. We conducted extensive experiments to evaluate the effectiveness of our approach on real and synthetic datasets.
Overall, our experiments demonstrate outstanding results – that is our analysis-aware approaches are significantly better compared to traditional ER techniques, especially when the query is very selective.