Theoretical Analysis and Efficient Algorithms for Crowdsourcing
- Author(s): Li, Hongwei
- Advisor(s): Yu, Bin
- et al.
Crowdsourcing has become an effective and popular tool for human-powered computation to label large datasets. Since the workers can be unreliable, it is common in crowdsourcing to assign multiple workers to one task, and to aggregate the labels in order to obtain results of high quality. This thesis is dedicated to analyze the process of crowdsourced labeling and propose algorithms to improve the efficiency of this process.
In Chapter 1, we give an overview of crowdsourcing and its applications. Meanwhile, we address the challenges of labeling by crowdsourcing and present the organization of the thesis.
In Chapter 2, we provide finite-sample exponential bounds on the error rate (in probability and in expectation) of general aggregation rules under the Dawid-Skene crowdsourcing model. The bounds are derived for multi-class labeling, and can be used to analyze many aggregation methods, including majority voting, weighted majority voting and the oracle Maximum A Posteriori (MAP) rule. We show that the oracle MAP rule approximately optimizes the upper bound on the mean error rate of weighted majority voting in certain setting.
In Chapter 3, we propose an iterative weighted majority voting (IWMV) method that optimizes the error rate bound and approximates the oracle MAP rule. Its one step version has a provable theoretical guarantee of the error rate. The IWMV method is intuitive and computationally simple. Compared to the traditional EM approach, IWMV is more robust to the noise of estimating workers' reliabilities and model misspecification. Experimental results on both simulation and real data show that IWMV performs at least on par with the state-of-the-art methods, and it has low computation cost (around one hundred times faster than the state-of-the-art methods).
In Chapter 4, we study the problem of selecting subsets of workers from a given worker pool to maximize the accuracy under a budget constraint. One natural question is whether we should hire as many workers as the budget allows, or restrict on a small number of top-quality workers. By theoretically analyzing the error rate of a typical setting in crowdsourcing, we frame the worker selection problem into a combinatorial optimization problem and propose an algorithm to solve it efficiently. Empirical results on both simulated and real-world datasets show that our algorithm is able to select a small number of high-quality workers, and performs as good as, sometimes even better than, the much larger crowds as the budget allows.
In practice, worker quality (for some tasks) can be associated with some explicit characteristics of the worker, such as education level, major and age; although it is also possible that in some other tasks none of the attributes matter. In Chapter 5, we propose a general crowd targeting framework that can automatically discover, for a given task, if any group of workers based on their attributes have higher quality on average, and target these groups, if they exist, for future work of the same task. This framework is complementary to traditional worker quality estimation approaches and is more budget efficient because we are able to target potentially good workers before they actually do the task. Experiments on real datasets show that the accuracy of final prediction can be improved significantly for the same budget or even less budget in some cases. Our framework is very practical and can be easily integrated in current crowdsourcing platforms.