In modern statistical applications, we are often faced with situations
where there is either too little or too much data. Both extremes can
be troublesome: Interesting models can only be learnt when sufficient
amounts of data are available, yet these models tend to become
intractable when data is abundant. An important thread of research
addresses these difficulties by subsampling the data prior to learning
a model. Subsampling can be active (i.e. active learning) or
randomized. While both of these techniques have a long history, a
direct application to novel situations is in many cases
problematic. This dissertation addresses some of these issues.
We begin with an active learning strategy for spectral clustering when
the cost of assessing individual similarities is substantial or
prohibitive. We give an active spectral clustering algorithm which
iteratively adds similarities based on information gleaned from a
partial clustering and which improves over common alternatives.
Next, we consider active learning in Bayesian models. Complex Bayesian
models often require an MCMC-based method for inference, which makes a
naive application of common active learning strategies
intractable. We propose an approximate active learning method which
reuses samples from an existing MCMC chain in order to speed up the
computations.
Our third contribution looks at the effects of randomized subsampling
on Gaussian process models that make predictions about outliers and
rare events. Randomized subsampling risks making outliers even
rarer, which, in the context of Gaussian process models, can lead to
overfitting. We show that Heavy-tailed stochastic processes can be
used to improve robustness of regression and classification estimators
to such outliers by selectively shrinking them more strongly in sparse
regions than in dense regions.
Finally, we turn to a theoretical evaluation of randomized subsampling
for the purpose of inferring rankings of objects. We present two
simple algorithms that predict a total order over n objects from a
randomized subsample of binary comparisons. In expectation, the
algorithms match an &Omega(n) lower bound on the sample complexity
for predicting a permutation with fixed expected Kendall tau
distance. Furthermore, we show that given O(nlog(n)) samples, one
algorithm recovers the true ranking with uniform quality, while the
other predicts the ranking more accurately near the top than the
bottom. Due to their simple form, the algorithms can be easily
extended to online and distributed settings.