Query Answering in Data Integration Systems
This dissertation focuses on several important problems in query answering over a single or multi-source database. The first part of this work focuses on the challenges faced in querying a distributed set of data sources. An Online Answering Systems for Integrated Sources (OASIS) is presented which considers source coverage, overlap, and cost to order source accesses such that answers are returned as soon as possible. The first component of OASIS is a fast and scalable method for estimating source overlaps. The second component utilizes overlap estimates and two ordering strategies (static and dynamic) to order source accesses. The last component applies several heuristics to select additional overlap statistics to compute with the goal of obtaining a better source ordering. The efficiency and effectiveness of the proposed system are demonstrated through an extensive experimental evaluation. While the first part of the dissertation focuses on multi-source query answering, the second part of the dissertation focuses on single-source query answering. First, a candidate document ordering strategy is presented followed by a query processing approach for XML documents and queries. The first component considers a candidate document ordering strategy to minimize the expected time to the first k matches. The optimization problem is considered for applications which contain document precedence constraints which restrict the order in which candidate documents must be processed. The second component presents a unified method for solving three important problems in XML structural matching, namely, Filtering, Query Processing, and Tuple-Extraction. The queries and XML documents are represented using a sequential encoding, referred to as Node Encoded Tree Sequences (NETS). The unified solution for the three problems is composed of two procedures, subsequence matching and structural matching, which can be executed concurrently or sequentially depending on the problem. The solution for subsequence matching is based on the dynamic programming recurrence relation for the Longest Common Subsequence (LCS) problem. For structural matching, a new necessary and sufficient condition is presented which provides a simple verification procedure. In addition to using a unified framework, (for easier implementation and maintenance), experimental results show that the proposed algorithms outperform state of the art approaches for the three XML processing problems.