 Open Access Publications from the University of California

## Models and Algorithms for Crowdsourcing Discovery

• Author(s): Faridani, Siamak
Crowdsourcing can also be used to collect data on human motor movement. Fitts' law is a classical model to predict the average movement time for a human motor motion. It has been traditionally used in the field of human-computer interaction (HCI) as a model that explains the movement time from an origin to a target by a pointing device and it is a logarithmic function of the width of the target ($W$) and the distance of the pointer to the target ($A$). In the next project we first present the square-root variant of the Fitts' law similar to Meyer et el. \cite{meyer1988optimality}. To evaluate this model we performed two sets of studies, one uncontrolled and crowdsourced study and one in-lab controlled study with 46 participants. We show that the data collected from the crowdsourced experiment accurately follows the results from the in-lab experiments. For Homogeneous Targets the Square-Root model ($T= a + b \sqrt{\frac{A}{W}}$) results in a smaller ERMS error than the two other control models, LOG ($T = a +b\log{\frac{2A}{W}}$) and LOG' ($T = a +b\log{\frac{A}{W}+1}$) for $A/W<10$. Similarly for Heterogeneous Targets the Square-Root model results in a significantly smaller ERMS error when compared to the LOG model for $A/W<10$. The LOG model resulted in significantly smaller ERMS error in the $A/W>15$. In the Heterogeneous Targets the LOG' model consistently resulted in a significantly smaller error for $0 Opinion Space is a system that directly elicits opinions from participants for idea generation. It uses both numerical and textual data and we look at methods to combine these two sets of data. Canonical Correlation Analysis, CCA, is used as a method to combine both the textual and numerical inputs from participants. CCA seeks to find linear transformation matrices that maximize the lower dimension correlation between the projection of numerical rating ($Xw_x$) and textual comments onto the two dimensional space ($Yw_y$). In other words it seeks to solve the following problem$argmax_{w_x,w_y} corr(Xw_x, Yw_y)$in which$X$and$Y$are representations of the numerical rating and textual comments of participants in high dimensions and$Xw_x$and$Yw_y$are their lower dimension representations. By using participants' numerical feedbacks on each others' comments, we then develop an evaluation framework to compare different dimensionality reduction methods. In our evaluation framework a dimensionality reduction is the most appropriate for Opinion Space when the value of$\gamma = -corr(r,D)$has the largest value. In$\gamma = -corr(R,D)$,$R$is the set of$r_{ij}$values.$r_{ij}$is the rating that the participant$i$is giving to the textual opinion of participant$j$. Similarly$D$is the set that contains$d_{ij}$values.$d_{ij}$is the Euclidean distance between the locations of participant$i$and$j$. In this dissertation we provide supporting argument as to why this evaluation framework is appropriate for Opinion Space. We have compared different variations of CCA and PCA dimensionality reductions on different datasets. Our results suggests that the$\gamma$values for CCA are at least$\%169$larger than the$\gamma$values of PCA, making CCA a more appropriate dimensionality reduction model for Opinion Space. A product review on an online retailer website is often accompanied with numerical ratings for the product on different scales, a textual review and sometimes information on whether or not the review is helpful. Generalized Sentiment Analysis looks at the correlation between the textual comment and numerical rating and uses that to infer the numerical ratings on different scales from the textual comment. We provide the formulations for using CCA for solving such a problem. We compare our CCA model with Support Vector Machine, Linear Regression, and other traditional machine learning models and highlight the strengths and weaknesses of this model. We found that training the CCA formulation is significantly faster than SVM which is traditionally used in this context (the fastest training time for SVM in LibSVM was 1,126 seconds while CCA took only 33 seconds for training). We also observed that the Mean Squared Error for CCA was smaller than other competing models (The MSE for CCA with tf-idf features was 1.69 while this value for SVM was 2.28). Linear regression was more sensitive to the featurization method. It resulted in larger MSE when used on multinomial ($MSE = 8.88$)and Bernoulli features ($MSE = 4.21$) but smaller MSE when tf-idf weights were used ($MSE=1.47\$).