Discovering Information Integration Specifications From Data Examples
- Author(s): Qian, Kun
- Advisor(s): Kolaitis, Phokion G.
- et al.
Two fundamental problems in information integration are data exchange and entity resolution. Data exchange is the task of translating data structured under a source schema into data structured under a target schema. Data exchange is captured by schema mappings that specify the relationship between a source schema and a target schema at a high level. Entity resolution is the task of identifying and linking different representations of the same real-world object. The goal of entity resolution is to create links among existing data. Although schema mapping and entity resolution have been successfully used in many domains, manually designing schema mappings and entity resolution algorithms is a labor-intensive and time-consuming process.
In this dissertation, we develop example-driven discovery/learning methods for high-level declarative schema mapping specifications and high-level declarative entity resolution algorithms. This dissertation contains two parts. In Part I, we present our work on extending and refining two major example-driven schema-mapping discovery frameworks, namely, the repair framework introduced by Gottlob and Senellart and the learning framework introduced by ten Cate et al. Gottlob and Senellart introduced a framework for schema-mapping discovery from a single data example, in which the derivation of a schema mapping is cast as an optimization problem. We refine and study this framework in more depth. Among other results, we design a polynomial-time log(n)-approximation algorithm for computing optimal schema mappings from a given set of data examples for a restricted class of schema mappings; moreover, we show that this approximation ratio cannot be improved. We implemented the aforementioned log(n)-approximation algorithm and carried out an experimental evaluation in a real-world mapping scenario. As opposed to the repair framework, in which the schema-mapping discovery problem is cast as an optimization problem, the derivation of a schema mapping is cast as a computational learning problem in the learning framework. We design a learning algorithm that is an Occam algorithm leading up to a PAC learning algorithm for an important class of schema mappings. We also implemented the proposed algorithm and carried out an experimental evaluation using mapping scenarios created by iBench, which is a state-of-the-art benchmarking tool. In Part II, we introduce a new active learning system for entity resolution that learns high-quality entity resolution algorithms. Our focus is on learning entity resolution algorithms in big data scenarios. We implemented the aforementioned active learning system and carried out an experimental evaluation in two real-world big data entity resolution scenarios.