The solution to a supervised computer vision problem consists of an application, algorithm, input data, and a set of human generated labels. Solving these kinds of tasks involves collecting large quantities of data, collecting appropriate labels, and developing machine vision algorithms tailored to the application. Progress on these problems has often benefited from large scale datasets with high fidelity labels. Successful algorithms display a synergy between application goals and the size and quality of the dataset. This thesis presents work highlighting the importance of each component of a supervised vision task.
First, the problem of automatically classifying groups of people into social categories is introduced. This problem is called Urban Tribe Classification. To tackle this problem, each individual and the entire group of individuals are modeled. Since this was a newly introduced computer vision problem, a dataset for this task was created. On this dataset, the combined representation of group and individuals outperforms using only the person representations. This model showed promising results for automatic subculture classification.
Second, the problem of creating perceptual embeddings based on human similarity judgements is tackled. This work focuses on triplet similarity comparisons of the form ``Is object $i$ more similar to $j$ or $k$?'', which have been useful for computer vision and machine learning applications. Unfortunately, triplet similarity comparisons, like many human labeling efforts, can be prohibitively expensive. This work proposes two techniques for dealing with this obstacle. First, an alternative display for collecting triplets is designed. This display shows a probe image and a grid of query images, allowing the user to collect multiple triplets simultaneously. The display is shown to reduce the cost and time of triplet collection. In addition, higher quality embeddings are created with the improved triplet collection UI. A 10,000-food item dataset of human taste similarity was created using this UI. Second, ``SNaCK,'' a low-dimensional perceptual embedding algorithm that combines human expertise with automatic machine kernels, is introduced. Both parts are complementary: human insight can capture relationships that are not apparent from the object's visual similarity and the machine can help relieve the human from having to exhaustively specify many constraints.
Finally, the precise localization of key frames of an action is explored. This work focuses on detecting the exact starting frame of a behavior, an important task for neuroscience research. To address this problem, a loss designed to penalize extra and missed action start detections over small misalignments. Recurrent neural networks (RNN) are trained to optimize this loss. The model is shown to reduce the number of false positives, an important criteria defined by the neuroscientist. The performance of the model is evaluated on a new dataset, the Mouse Reach Dataset, a large, annotated video dataset of mice performing a sequence of actions. The dataset was created for neuroscience research. On this dataset, the proposed model outperforms related approaches and baseline methods using an unstructured loss.