Ke, Tsung-Wei

Learning Visual Groupings and Representations with Minimal Human Labels

2022

Abstract

Making a computer system understand complex image scenes is challenging. Complex imagescenes often have multiple objects, which are not isolated but related to each other in different aspects. Identifying certain object categories may not be enough to understand complex scenes. Categories have multiple granularities. We need such knowledge to capture semantic correlation thoroughly. In addition, objects have numerous interactions/relationships. We need to localize these objects, recognize scene environments, and figure out their interaction- s/relationships. In computer vision, recognizing what the categories are, where the objects are, and how objects interact to each other is often formulated as the classification, segmentation, and relationship recognition problem.

Existing approaches often tackle all these formulations in supervised settings. Despite theirtremendous progress, we identify three major limitations. 1) Human annotation is too time- and labor-consuming to scale up to real-world scenarios. 2) The sets of human labels are pre-selected arbitrarily, providing limited/biased perspectives to understand images. 3) Such supervised methods conduct inference in terms of discrete labeling. They isolate labels from each other, ignoring the similarity/dissimilarity among each other. Also, they can only put images to the known labels seen during training and fail to recognize novel images sampled from unknown labels during testing.

In this dissertation, we address the issues of current supervised approaches by replacingdiscrete labeling with grouping and using minimal human labels. Specifically, we tackle the recognition problem from four perspectives. 1) We address weakly-supervised semantic segmentation, where partial semantic pixel labels are used. 2) We address unsupervised semantic segmentation, where only low-level edge detections are used. 3) We address unsupervised concurrent image classification and segmentation in a single framework, where our model does not use any human labels. 4) We address unsupervised human-object recognition, where semantic and instance pixels labels, no relationship labels, are used. This dissertation explores more general and robust approaches to understanding the highly-complex and fast-changing real-world scene.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Berkeley

Learning Visual Groupings and Representations with Minimal Human Labels