Making a computer system understand complex image scenes is challenging. Complex imagescenes often have multiple objects, which are not isolated but related to each other in different
aspects. Identifying certain object categories may not be enough to understand complex
scenes. Categories have multiple granularities. We need such knowledge to capture semantic
correlation thoroughly. In addition, objects have numerous interactions/relationships. We
need to localize these objects, recognize scene environments, and figure out their interaction-
s/relationships. In computer vision, recognizing what the categories are, where the objects are,
and how objects interact to each other is often formulated as the classification, segmentation,
and relationship recognition problem.
Existing approaches often tackle all these formulations in supervised settings. Despite theirtremendous progress, we identify three major limitations. 1) Human annotation is too time-
and labor-consuming to scale up to real-world scenarios. 2) The sets of human labels are
pre-selected arbitrarily, providing limited/biased perspectives to understand images. 3) Such
supervised methods conduct inference in terms of discrete labeling. They isolate labels from
each other, ignoring the similarity/dissimilarity among each other. Also, they can only put
images to the known labels seen during training and fail to recognize novel images sampled
from unknown labels during testing.
In this dissertation, we address the issues of current supervised approaches by replacingdiscrete labeling with grouping and using minimal human labels. Specifically, we tackle
the recognition problem from four perspectives. 1) We address weakly-supervised semantic
segmentation, where partial semantic pixel labels are used. 2) We address unsupervised
semantic segmentation, where only low-level edge detections are used. 3) We address
unsupervised concurrent image classification and segmentation in a single framework, where
our model does not use any human labels. 4) We address unsupervised human-object recognition, where semantic and instance pixels labels, no relationship labels, are used. This
dissertation explores more general and robust approaches to understanding the highly-complex
and fast-changing real-world scene.