Due to the large size of many structured and semi-structured databases, queries often return a large number of relevant results. Result diversification has recently been proposed as an approach to increase user satisfaction in search engines and recommendation systems. The top-k returned results are not only relevant to the query but also as diverse as possible from each other.
In this dissertation, we address three diversification related problems and propose efficient solutions for them. We firstly consider diversification on semi-structured data. We show that diversity can occur not only in the document content but also (and more importantly) in the document structure. We present a novel algorithm for diversification that considers both the structure and the content of the matched results. We propose a distance measure that is an order of magnitude faster than the standard tree-edit distance. The second problem considers how to balance relevance and diversity in the final top-k returned results. Previous works balance relevance and diversity mostly in a predefined fixed way. We propose a principled method for adaptive diversification of query results that minimizes the user effort to find the desired results by dynamically balancing the relevance and diversity. Finally, we consider the distributed diversification problem on large result-sets dispersed over many nodes. Using the MapReduce framework, we consider two distinct approaches, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our methods are iterative in nature, allowing the user to continue refining the result if more time is available. Moreover, we prove that this iteration process converges while producing a 2-approximate diversified result when compared to the optimal solution.
Furthermore, in the last part of the thesis, we investigate the problem of answering top-k queries satisfying spatial constraints. We propose a novel index structure that uses R-tree to tackle the spatial constraints, and pre-computed inverted lists to answer the top-k queries efficiently using the well known threshold algorithm. We present a model than can estimate the expected size of the inverted lists for the threshold algorithm using the data properties and query parameters. With the proposed model, we could reduce the index size significantly without sacrificing the performance of the threshold algorithm.