An Automated Perceptual Learning Algorithm for Determining Structure-Based Visual Prototypes of Objects from Internet-Scale Data
- Author(s): Chen, Lichao
- Advisor(s): Roychowdhury, Vwani
- et al.
Object discovery and representation lies at the heart of computer vision, and therefore it has attracted widespread interest in the past several decades. Early efforts were largely based on single template models, bag-of-visual-word models, and part-based models. To represent the intra-class variety of the same type of object and address partial occlusion problem in images, more complex object representations, like attribute-based and part-based models, have been proposed. The advent of the Internet, however, enables one to obtain a comprehensive set of images describing the same object as viewed from different angles and perspectives, and its natural association with other objects. This opens up new opportunities and challenges: Given that for the first time we have millions of exemplars of an object embedded in its natural context, can one effectively mimic human-like cognition and build up prototypes (comprising parts, their different views, and their spatial relationships) for each object category? The well-known supervised approach relies heavily on well labeled image datasets and it (i) is still prohibitively hard for image labeling to catch up with the speed of image crawling, and (ii) does not lead to succinct prototype models for each category, which can then be used to locate object instances in a query. In my dissertation, we investigated the open problem of constructing part-based object representation models from very large scale image databases in an unsupervised manner.
To achieve this goal, we first define a network model from a full Bayesian setting. This augmented network model has spatial information in it, and is scale invariant throughout any image resolution variations in the learning set. This network model is able to find visual templates of the same part with dramatically different visual appearances, which, in existing models, have to be added manually or using text information from the Internet. We show that the global spatial structure of the underlying and unknown objects can be restored completely from the recorded pairwise relative position data. We also developed an approach to learn the graphical model in a completely unsupervised manner from a large set of unlabeled data, and the corresponding algorithm to do detection using the learned model. We also apply our algorithm to various crawled and archived datasets, show that our approach is computationally scalable and can construct part-based models much more efficiently than those presented in the recent computer vision literature.