Skip to main content
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Knowledge Extraction and Retrieval for Domain-Specific Documents


With the overwhelming amount of textual data created by more and more domain based information systems, it has been a significant challenge to identify the precise piece of relevant knowledge “nugget” from the tremendous amount of noisy, relevant and irrelevant data, including techno-geek and gobbledygook. When dealing with domain-specific text, many existing text mining methods fail to produce satisfying results, because they are unable to handle complex domain languages, understand semantic meaning, model latent business processes, or leverage domain resources and expertise. This motivates us to develop novel, effective extraction models and analyses to identify desired information from domain-specific documents, as well as associated retrieval models and analyses. In the dissertation, we study this research topic in three different domains, and approach the challenges in domain-specific text mining from multiple perspectives.

In an enterprise service center, accurate and timely delivery of knowledge to service representatives becomes the cornerstone for delivering efficient customer service. There are two main steps in achieving this objective. The first step concerns efficient text mining to extract information of interest from the very long service request (SR) documents in the historical database. The second step concerns matching new service requests with previously solved service requests. Both lead to efficiencies by minimizing time spent by service personnel in accessing knowledge. In this scenario we present our text analytics system, the Service Request Analyzer and Recommender (SRAR), which is designed to improve the productivity in an enterprise service center for computer networking diagnostics and support. SRAR unifies a text preprocessor, a hierarchical classifier, and a service request recommender, to deliver critical, pertinent, and categorized knowledge for improved service efficiency. The novel feature we report here is identifying the components of the diagnostic process underlying the creation of the original text documents. This identification is crucial in the successful design and prototyping of SRAR and its hierarchical classifier elements. Equally, the use of domain knowledge and human expertise to generate features are indispensable components in improving the accuracy of knowledge extraction and retrieval. The empirical evaluation demonstrates the effectiveness of our framework and algorithms. We observe significant improvements of service time responsiveness during knowledge extraction and retrieval in the networking service center context at Cisco.

In the healthcare domain, crucial information on a patient’s physical or mental conditions is provided by mentions of disorders in clinical notes. However, there are many surface forms of the same disorder concept documented in clinical text. Some are even recorded disjointedly, briefly, or intuitively. In this study, we propose a synergistic approach to extracting disorder concepts and variants. We exploit rich features to predict mention spans using supervised learning algorithms, including support vector machines (SVM). In addition to the explicit bag-of-words, orthographic, and morphologic features, we investigate semantic, syntactic, and sequential features for better capturing implicit relationships among words. More specifically, the two types of semantic features we propose based on medical ontology prove very effective. We supplement SVMs with a rule-based annotator and an unsupervised NLP system to improve the prediction accuracy and the generalization capability of the system. Ultimately, this synergistic system is able to produce state-of-the-art results on public challenge data sets.

In the biomedical domain, we define the notion of concept, extract all types of concepts from biomedical documents, and design a concept-based information retrieval framework. Using this framework, we transform documents and queries from term space into concept space, perform semantic analysis among concepts, and estimate a concept-based relevance model for improved document retrieval. Our approach has three advantages. First, it only assumes independence between concepts, so is able to keep the strong dependencies between the words of a concept. Second, it unifies synonyms or different surface forms of a concept, leading to reduced dimensionality of the space, increased sample size of a concept, and consequently more accurate and reliable estimates of the relevance. Third, when domain resources are available, our approach enables the semantic analysis of query concepts, and thus identifies concepts related to the query, from which a more accurate distribution of relevance can be estimated. We compare our approach with three benchmark retrieval models on different types of data collections. The proposed approach demonstrates consistent and statistically significant improvements across collections, outperforming top benchmark conceptual language models by at least 9% and up to 20% on a number of metrics.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View