Visual Learning with Weak Supervision: Applications in Video Summarization and Person Re-Identification
- Author(s): Panda, Rameswar
- Advisor(s): Roy-Chowdhury, Amit K
- et al.
Many of the recent successes in computer vision have been driven by the availability of large quantities of labeled training data. However, in the vast majority of real-world settings, collecting such data sets by hand is infeasible due to the cost of labeling data or the paucity of data in a given domain. One increasingly popular approach is to use weaker forms of supervision that are potentially less precise but can be substantially less costly than producing explicit annotation for the given task. Examples include domain knowledge, weakly labeled data from the web, constraints due to physics of the problem or intuition, noisy labels from distant supervision, unreliable annotations obtained from the crowd workers, and transfer learning settings. In this thesis, we explore two important and highly challenging problems in computer vision, namely video summarization and person re-identification, where learning with weak supervision could be extremely useful but remains as a largely under-addressed problem in the literature.
One common assumption of many existing video summarization methods is that videos are independent of each other, and hence the summarization tasks are conducted separately by neglecting relationships that possibly reside across the videos. In the first approach, we investigate how topic-related videos can provide more knowledge and useful clues to extract summary from a given video. We develop a sparse optimization framework for finding a set of representative and diverse shots that simultaneously capture both important particularities arising in the given video, as well as, generalities identified from the set of topic-related videos. In the second approach, we present a novel multi-view video summarization framework by exploiting the data correlations through an embedding without assuming any prior correspondences/alignment between the multi-view videos, e.g., uncalibrated camera networks. Via extensive experimentation on different benchmark datasets, we validate both of our approaches and demonstrate that our frameworks are able to extract better quality video summaries compared to the state-of-the-art alternatives.
Most work in person re-identification has focused on a fixed network of cameras. However, in practice, new camera(s) may be added, either permanently or on a temporary basis. In the final part of the dissertation, we show that it is possible to on-board new camera(s) to an existing network using domain adaptation techniques with limited additional supervision. We develop a domain perceptive re-identification framework that can effectively discover and transfer knowledge from the best source camera (already installed) to a newly introduced target camera(s), without requiring a very expensive training phase. Our approach can greatly increase the flexibility and reduce the deployment cost of new cameras in many real-world dynamic camera networks.