Integration, Provenance, and Temporal Queries for Large-Scale Knowledge Bases
- Author(s): Gao, Shi
- Advisor(s): Zaniolo, Carlo
- et al.
Knowledge bases that summarize web information in RDF triples deliver many benefits, including support for natural language question answering and powerful structured queries that extract encyclopedic knowledge via SPARQL. Large scale knowledge bases grow rapidly in terms of scale and significance, and undergo frequent changes in both schema and content. Two critical problems have thus emerged: (i) how to support temporal queries that explore the history of knowledge bases or flash-back to the past; (ii) how to integrate knowledge from difference sources and improve the quality of integrated knowledge base while preserving the provenance information. In this dissertation, we propose a framework that supports knowledge integration, temporal query evaluation and user-friendly interfaces for large-scale knowledge bases. Towards this goal, we make the following contributions:
(i) We propose SPARQLT, a temporal extension of structured query language SPARQL based on a point temporal model which simplifies the expression of temporal joins and eliminates the need for temporal coalescing. This approach makes possible an end-user interface HKB (Historical Knowledge Browser) where users can browse the evolution history of knowledge bases and express historical queries via simple by-example conditions in the infoboxes of Wikipedia pages.
(ii) We have designed and implemented RDF-TX (RDF Temporal eXpress), an efficient system for managing temporal RDF data and evaluating SPARQLT queries. RDF-TX takes advantage of compressed Multiversion B+ trees to achieve fast evaluation of temporal queries. The experimental result demonstrates that our indexing and query optimization techniques deliver superior performance over other systems.
(iii) We propose a framework for knowledge extraction and integration. We first introduce IBMiner, a novel NLP-based system that derives knowledge bases from free text and preserves the provenance of extracted triples. IBminer uses a deep NLP-based approach to extract subject-attribute-value triples from free text, and maps the attributes to those introduced in existing knowledge bases. Then we integrate public knowledge bases with the knowledge base generated by IBMiner into one of superior quality and coverage, called IKBStore. User-friendly interfaces are provided to manage the knowledge in IKBStore while maintaining provenance information.