Visual Learning with Weak Supervision
- Author(s): CICEK, BAYRAM SAFA
- Advisor(s): SOATTO, STEFANO
- et al.
We focus on two broad learning setups: The first one is the classic semi-supervised learning (SSL), wherein few labeled samples and many unlabeled samples are drawn from the same distribution. Second is a more challenging setting called unsupervised domain adaptation (UDA), where labeled and unlabeled samples are drawn from slightly different distributions. This setting is more practical as labeled data can be drawn from synthetic data, which can be produced abundantly using graph engines with the expense of domain shift from the real data.
Our contributions are multi-faceted. For the SSL setup, we show that the training speed in the supervised setting correlates strongly with the percentage of correct labels. Since the speed of the training is a measurable quantity at the training time unlike the accuracy of the estimates, we propose to use it as an inference criterion. We additionally show that robustness to small perturbations in input space and weight space do not imply each other and they both improve generalization performance. Then, we propose a method that combines an input-space smoothing with a weight-space smoothing.
In the UDA setup, we begin by proposing a method that trains a shared embedding to align the distributions of the learned features conditioned on the classes. To have more interpretable models, we also explore the use of generative modeling where we generate images in the unlabeled target domain in a manner that allows independent control of class and nuisance variability. We also study UDA for dense prediction tasks like semantic segmentation where the manual annotation is more costly. Lastly, we examine monocular depth prediction tasks where the goal is to infer dense depth maps from images. Obtaining the 3D scene from a single image is a degenerate task as there are infinitely many 3D scenes compatible with the given image. So, we also leverage LiDAR measurements, which brings the additional challenges of processing sparse inputs. We propose to use only sparse depth as input, not images, so the method is not affected by the covariate shift and we use the image to refine the predicted depth map.
Finally, we study the effect of adversarial perturbations on the networks trained in the unsupervised fashion particularly for the task of monocular depth prediction. We explore the ability of small, imperceptible additive perturbations to selectively alter the perceived geometry of the scene. Our work helps to understand the corner cases and failure modes to develop more robust representations. This, in turn, will improve interpretability (or, rather, reduce nonsensical behavior).