Efficient crowdsourcing algorithms for discovering the top items
Crowdsourcing provides a convenient way to collect information from humans. It is proved to be an effective approach to solve large scale problems with relatively low cost. Discovering top items from large datasets is a fundamental problem, which finds its applications in various areas, such as players ranking, candidates recruiting, students admission, and so on. Crowdsourcing can provide cost-efficient solutions for these tasks, which usually require a large number of inputs.
We start the research by exploring crowdsourcing data for ranking. Two common data types, numerical grades and pairwise comparisons, are examined. We design grading and comparison tasks with the same budgets and run them in crowdsourcing platform. After collecting crowd workers’ responses, we analyze the influence factors and compare the precisions of two types of data on ranking tasks. We choose to use pairwise comparisons in the rest of this dissertation based on the analysis results.
Generally, there are two scenarios in the problem of top item discovery. In one scenario, the number of target top items is unknown and unidentifiable, meanwhile the qualities of top items are more important than the bottom ones, i.e., the errors in top items incur larger losses. In this circumstance, the problem of identifying top items is essentially a top-heavy ranking problem. In another scenario, the number of target top items can be predefined or known. Such scenario occurs frequently when there is a constraint on the amount of resources available. The problem is equivalent to a top-K selection problem in this circumstance. In this dissertation, we propose novel algorithms to solve problems of both scenarios.
For top-heavy ranking problems, as there is extensive research on aggregating pairwise comparisons into rankings, we focus on studying strategies for selecting pairs of items to be sent to the crowd, in order to make rankings converge as quickly as possible to the correct results. We define a “top-heavy” version of ranking distance, and propose two algorithms to select the next pairs of items for the crowd to compare. In particular, we define the loss involved in each pair of items, and design strategies which select pairs that have maximum loss, or that lead to maximum ranking change. We demonstrate that the rankings converge significantly faster with our algorithms in the experiments. We further extend our algorithm to an efficient batched version, which delivers the comparable ranking results, while dramatically reduces the computation time and complexity.
For top-K selection problems, we propose an online top-K selection algorithm, as existing approaches are either inefficient as a result of having to learn a full ranking which incurs unnecessary crowdsourcing costs, or require a fixed amount of work for task completion and can not provide useful partial results if less work is performed. Our algorithm dynamically and repeatedly selects items into acceptance (in the top-K) and rejection (not in top-K) with an error tolerance ϵ>0. Partial results can be obtained at any point of time in the algorithm. We prove that the complexity of the algorithm is O(N logN), given an error bound and average probability of correctness of crowd. We further propose several optimization techniques to reduce the number of comparisons needed. Experimental results show that the algorithm can achieve comparable results with significantly fewer comparisons.