Sharing information across object templates
- Author(s): Zhu, Xiangxin
- Advisor(s): Ramanan, Deva
- et al.
Object detection is a central and challenging task in computer vision. In this thesis, we first examine the "big data" hypothesis: object detection might be solved with simple models backed with massive training data. We empirically show that the performance of one of the state-of-the-art methods (discriminatively trained HoG templates) tends to saturate fast when fed with more data. The required training data may need to grow exponentially in order to produce a fixed improvement in accuracy. We also find that the key difficulties in detection are large variation in object appearance and more importantly, that the variation exhibits a "long tail" distribution: there are many rare cases with little training data, which makes those cases hard to model. This thesis addresses such challenges by proposing new representations that share information within and across object subcategories. Sharing allows one to learn models for rare subcategories in the long-tail where traditional approaches suffer from lack of training data. We investigate two methods for sharing. We first examine global models that share entire training examples across multiple subcategories. For example, an SUV image might be used to train both a car and truck subcategory model. We also examine local sharing that share subwindows of training examples through "parts". For example, nearly all vehicles contain wheel parts. By mixing and matching (or composing) different parts together, one can implicitly encode an exponentially large set of subcategory models, which could even represent those subcategories not encountered in the training data.
We extensively experiment and evaluate our models on different benchmarks, and show superior performance over the state-of-the-art. Finally, we conclude with a detailed analysis of local part sharing for face analysis, perhaps the most well studied of all object recognition problems. By using semantically-defined parts (such as eyes, nose, lips), one can simultaneously perform face detection, pose estimation, and landmark localization with state-of-the-art accuracy, with a single model.