Skip to main content
eScholarship
Open Access Publications from the University of California

Semantic transfer with deep neural networks

  • Author(s): Dixit, Mandar
  • Advisor(s): Vasconcelos, Nuno
  • et al.
Abstract

Visual recognition is a problem of significant interest in computer vision. The current solution to this problem involves training a very deep neural network using a dataset with millions of images. Despite the recent success of this approach on classical problems like object recognition, it seems impractical to train a large scale neural network for every new vision task. Collecting and correctly labeling a large amount of images is a big project in itself. The process of training a deep network is also fraught with excessive trial and error and may require many weeks with relatively modest hardware infrastructure. Alternatively one could leverage the information already stored in a trained network for several other visual tasks using transfer learning.

In this work we consider two novel scenarios of visual learning where knowledge transfer is affected from off-the-shelf convolutional neural networks (CNNs). In the first case we propose a holistic scene representation derived with the help of pre-trained object recognition neural nets. The object CNNs are used to generate a bag of semantics (BoS) description of a scene, which accurately identifies object occurrences~(semantics) in image regions. The BoS of an image is, then, summarized into a fixed length vector with the help of the sophisticated Fisher vector embedding from the classical vision literature. The high selectivity of object CNNs and the natural invariance of their semantic scores facilitate the transfer of knowledge for holitistic scene level reasoning. Embedding the CNN semantics, however, is shown to be a difficult problem. Semantics are probability multinomials that reside in a highly non-Euclidean simplex. The difficulty of modeling in this space is shown to be a bottle-neck to implementing a discriminative Fisher vector embedding. This problem is overcome by reversing the probability mapping of CNNs with a natural parameter transformation. In the natural parameter space, the object CNN semantics are efficiently combined with a Fisher vector embedding and used for scene level inference. The resulting semantic Fisher vector achieves state-of-the-art scene classification indicating the benefits of BoS based object-to-scene transfer.

To improve the efficacy of object-to-scene transfer, we propose an extension of the Fisher vector embedding. Traditionally, this is implemented as a natural gradient of Gaussian mixture models (GMMs) with diagonal covariance. A significant amount of information is lost due to the inability of these models to capture covariance information. A mixture of Factor analyzers (MFAs) are used instead to allow efficient modeling of a potentially non-linear data distribution in the semantic manifold. The Fisher vectors derived using MFAs are shown to improve substantially over the GMM based embedding of object CNN semantics. The improved transfer-based semantic Fisher vectors are shown to outperform even the CNNs trained on large scale scene datasets.

Next we consider a special case of transfer learning, known as few-shot learning, where the training images available for the new task are very few in number (typically less than 10). Extreme scarcity of data points prevents learning a generalize-able model even in the rich feature space of pre-trained CNNs. We present a novel approach of attribute guided data augmentation to solve this problem. Using an auxiliary dataset of object images labeled with 3D depth and pose, we learn trajectories of variations along these attributes. To the training examples in a few-shot dataset, we transfer these learned attribute trajectories and generate synthetic data points. Along with the original few-shot examples, the additional synthesized data can also be used for the target task. The proposed guided data augmentation strategy is shown to improve both few-shot object recognition and scene recognition performance.

Main Content
Current View