Robust Detection with Local Steering Kernel: Maximum Margin Matrix Cosine Similarity and Beyond
- Author(s): Biswas, Sujoy Kumar
- Advisor(s): Milanfar, Peyman
- et al.
An important lesson of two decades of research in object detection comes from the success of mid-level attributes like filters or templates. Histogram of Oriented Gradients, or HOG, had long been the standard tool for representing such templates. With the introduction of convolutional neural network the focus of the feature computation has recently shifted toward the more powerful representation learning techniques. Though powerful and better performing, the performance benefit in representation learning comes at the price of long training phase, complicated hardware requirements, and of course, a large set of data with clean annotations.
In this thesis we propose a fundamentally different representation for image templates in the form of multidimesional tensors that looks beyond the histogram features of HOG by aggressively capturing local image geometry. As a consequence, the proposed tensor representation of templates is robust to noise and signal perturbation, and yield excellent localization performance in unconventional and difficult scenarios (e.g., low resolution, noisy image or video) where traditional HOG features fail to perform well. Moreover, owing to signal processing techniques, tensors are amenable to a rich set of tools that make object detection fast, efficient and scalable. Using an exact acceleration of matrix cosine similarity (our decision rule for detection) we make the search for a query image in a bigger target much faster. Building on these premises we have proposed a maximum margin formulation following a relatively simple and fast training phase, to detect pedestrians in challenging videos of infrared, thermal images. The proposed kernel method is robust enough to handle missed annotations, i.e., noisy annotations, in the ground truth as exhibited in our experimental findings. The thesis contributes further by proposing a dimensionality reduction technique that not only reduces the number of feature channels during detection but also preserves local image geometry in derived subspaces resulting in better discrimination between objects and background.