User-generated content on the Internet has been explosively growing in the current Web 2.0 era. This has been facilitated through widespread user access to the web through mobile devices, the rapid growth of social media applications, and review-based provider websites. The majority of this data is in the form of free text, as in social posts. Storing and querying this massive unstructured textual data is a challenging task that has been studied extensively recently.
Current search solutions, such as Google, Bing and Amazon’s internal search, are effective in allowing users to find relevant documents in large collections. Those solutions rely on several content and reputation-based factors including document relevance to the user query. However, capturing and exploiting user intent particularly, in a domain-specific setting, remains an open problem with a variety of research challenges. In this thesis, we study several such settings where existing search techniques are inadequate.
In particular, we studied the following subproblems where we are showcasing the benefit of leveraging domain-specific knowledge and user-generated content: 1) We argue for more effective item ranking for crowd-sourced review platforms and provide efficient algorithms to support it. 2) We provide a practical high-quality solution to build domain-specific ontologies from unstructured text documents. We describe our approach and provide fast and simple algorithms to use the generated ontology in extracting domain-specific features from the textual data. In particular, we describe our approach using a real-estate agency case study where domain agents are interested in evaluating the textual property descriptions. 3) We study how to search for similar documents, given a set of input documents, when the data source can only be accessed through a query interface (such as Google search). We propose a ranking model to extract effective query keywords from the input documents to retrieve similar documents through keyword-based search APIs. 4) We use data mining techniques to classify user-generated content on online forums in terms of its characteristics, such as bullying behavior. In particular, we crawl Yik Yak, an anonymous social media, to detect potentially harmful behaviors.