Training and Evaluating Visual Recognition Systems with Limited Annotations
- Author(s): Nguyen, Phuc Xuan
- Advisor(s): Fowlkes, Charless
- Ramanan, Deva
- et al.
In recent years, large-scale datasets with high-quality annotations have enabled many significant discoveries in computer vision and machine learning. However, the luxury of large-scale human annotations is not always affordable. For example, image semantic segmentation requires annotations for every pixel in an image. Similarly, action localization in untrimmed videos requires fine boundaries of action instances. The annotation effort for these tasks is prohibitively expensive to scale. What can we do in such scenarios? In this thesis, we look at different ways to circumvent this challenge. First, we examine micro-videos, a new and abundant source of visual data with user-generated metadata that could serve as web-supervision. Second, we explore weak supervision, in the context of action localization. While fine boundaries annotations of action instances are expensive and difficult to obtain, video-level labels can be cheaper and more accessible. We present two state-of-the-art systems that leveraged on this weaker form of supervision and compare them to their fully supervised counterparts. Lastly, we address the problems of accurately measuring the performance of computer vision systems with limited human annotations.