Modeling object is one of the core problems in computer vision. A good object model can be applied to multiple visual recognition tasks. In this thesis, we use compositional relations to model the objects and parts for articulated objects, which is very challenging due to the large variability by subtypes, viewpoints and poses. Intuitively, objects are composed of parts, and each part is then composed of subparts. By recursively applying the compositional relations, we can obtain a hierarchical structure of the object, which is called compositional model. Compared to the popular black-box deep learning system, our model enjoys the advantages of explainability, and adaptivity to more complex scenarios. There are three challenges for the compositional model: 1) How to learn the parts and subparts, 2) How to learn the compositional structure between parts and subparts, 3) How to allow efficient learning and inference. In this thesis, we mainly focus on three relevant projects using the concept of compositions and parts.
In the first project, we study the problem of semantic part segmentation for animals. Compositional model is used to represent boundaries of objects and parts. Given part-level boundary annotations, a mixture of models is learned to deal with various viewpoints and poses. We also incorporate edge, appearance, and semantic part cues into the compositional model. Besides, a linear complexity algorithm is developed for efficient inference. The evaluation using horse and cow images from PASCAL VOC demonstrates the effectiveness of our method.
In the second project, we provide a novel unsupervised method to learn object semantic parts from internal states of CNNs. Our hypothesis is that semantic parts are represented by populations of neurons rather than by single filters. We propose a simple clustering technique to extract part representations, which we call visual concepts. We qualitatively show that visual concepts are semantically coherent in that they represent most semantic parts of the object, and visually coherent in that their corresponding image patches appear very similar. Next, we treat the visual concepts as part detectors and quantitatively evaluate their performance for detecting semantic parts. The experiments on the PASCAL3D+ dataset and our newly annotated VehicleSemanticPart dataset support that CNNs have internal representations of object parts encoded by populations of neurons.
In the third project, we use the visual concepts for detecting the semantic parts on partially occluded objects. We consider a scenario where the model is trained using non-occluded images but tested on occluded images. The motivation is that there are infinite number of occlusion patterns in real world, so the models should be inherently robust and adaptive to occlusions instead of learning the occlusion patterns in the training data. Our approach uses a simple voting scheme, based on log-likelihood ratio tests and spatial constraints, to combine the evidence of local cues, which are derived from visual concepts. Experiments show that our algorithm outperforms several competitors in semantic part detection, especially in the presence of occlusions.