Data-driven technologies such as decision support, analysis, and scientific discovery tools have become a critical component of many organizations and businesses. The effectiveness of such technologies, however, is closely tied to the quality of data on which they are applied. That is why today organizations spend a substantial percentage of their budgets on cleaning tasks such as removing duplicates, correcting errors, and filling missing values, to improve data quality prior to pushing data through the analysis pipeline.
Entity resolution (ER), the process of identifying which entities in a dataset refer to the same real-world object, is a well-known data cleaning challenge. This process, however, is traditionally performed as an offline step prior to making the data available to analysis. Such an offline strategy is simply unsuitable for many emerging analytical applications that require low latency response (and thus can not tolerate delays caused by cleaning the entire dataset) and also in situations where the underlying resources are constrained or costly to use. To overcome these limitations, we study in this thesis a new paradigm for ER, which is that of progressive entity resolution. Progressive ER aims to resolve the dataset in such a way that maximizes the rate at which the data quality improves. This approach can help in substantially reducing the resolution cost since the ER process can be prematurely terminated whenever a satisfying level of quality is achieved.
In this thesis, we explore two aspects of the ER problem and propose a progressive approach to each of them. In particular, we first propose a progressive approach to relational ER, wherein the input dataset consists of multiple entity-sets and relationships among them. We then propose a parallel approach to entity resolution using the popular MapReduce (MR) framework. The comprehensive empirical evaluation of the two proposed approaches demonstrates that they achieve high-quality results using limited amounts of resolution cost.