Organizations and companies archive many versions of digital data such as web pages, internal emails and so on. Such data is critical for internal investigation, regulatory compliance, and electronic discovery. It is estimated that electronic discovery market that leverages archival data will reach $9.9 billions globally in 2017. It is not uncommon for many businesses to retain archived collections for 10 to 15 years. How to archive these versioned data is worth to study and we are facing many challenges including 1) traditional index occupies too much space for versioned data, 2) traditional search is too slow on versioned data, and 3) how to guarantee high accuracy when improving efficiency in new architecture.
In this dissertation, we take the opportunity of the fast development of information retrieval and tackle the problem by proposing a new multi-version search architecture with cache-conscious ranking optimization framework. Specifically, we will first discuss our new versioned search architecture. Then, we will talk about a cache-conscious online ranking algorithm to improve the online part. Finally, we will describe a framework to select best blocking methods and parameters for our algorithm to achieve best performance.
Firstly, we present our new multi-version search architecture. We propose an approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other feature information during runtime search. We compare several indexing and data traversal options with different time and space tradeoffs and describe evaluation results to demonstrate their effectiveness. The experiment results show that the proposed scheme can be up-to about 4x as fast as the previous work on solid state drives while retaining good relevance.
Secondly, we talk about our 2D blocking algorithm to optimize the online ranking part of the system. Multi-tree ensemble models have been proven to be effective for document ranking. Using a large number of trees can improve accuracy, but it takes time to calculate ranking scores of matched documents. We investigate data traversal methods for fast score calculation with a large ensemble and propose a 2D blocking scheme for better cache utilization with simpler code structure compared to previous work. The experiments with several benchmarks show significant acceleration in score calculation without loss of ranking accuracy.
Lastly, we describe a framework to fast select best blocking methods and parameters for our 2D blocking algorithm with the help of a full cache analysis. 2D blocking method is very helpful to improve online search efficiency. However, different traversal methods and blocking parameter settings can exhibit different cache and cost behavior depending on data and architectural characteristics. It is very time-consuming to conduct exhaustive search for performance comparison and optimum selection. We provide an analytic comparison of cache blocking methods on their data access performance for an approximation and propose a fast guided sampling scheme to select a traversal method and blocking parameters for effective use of memory hierarchy. The evaluation studies with three datasets show that within a reasonable amount of time, the proposed scheme can identify a highly competitive solution that significantly accelerates score calculation.
In summary, we have proposed a new multi-version search architecture with cache-conscious ranking optimization for the online search part and a framework to help fast select best blocking methods and parameters with full cache analysis for the 2D blocking method. By proposing this new versioned search system, we can meet challenges from scalability, efficiency and accuracy in multi-version search, and we believe this work would be useful to future researchers in this direction.