Skip to main content
eScholarship
Open Access Publications from the University of California

Exploiting relationships for data cleaning

Abstract

In this paper we address the problem of data cleaning when multiple data sources are merged to create a single database. Specifically, we focus on the problem of determining if two representations in two different sources refer to the same entity. Current research has focused on linking records from different sources by computing the similarity among them based on their attribute values. Our approach explores a new research direction by exploiting relationships among records for the purpose of cleaning. Our approach is based on the hypothesis that if two representations refer ti the same entity, there is a high likelihood that they are strongly connected to each other through multiple relationships implicit in the database. We view the database as a graph in which nodes correspond to entities and edges to relationships among the entities. Any one of the existing conventional approaches is first used to determine possible matches among entities. Graph analysis techniques are then used to disambiguate among the various choices. While out approach is domain independent, it can be tuned to specific domains by incorporating domain specific rules. We demonstrate the applicability of our method to a large real dataset.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View