Search

Scholarly Works (64 results)

Sort By:

Show:

Article

Aeolia'n Sound

Yang, MIng

Contemporary Music Score Collection (2020)

This music score was submitted for the Kaleidoscope 2020 Call for Scores, an open access collaboration with the UCLA Music Library.

Article
Peer Reviewed

Effect of Anisotropic Consolidation on Cyclic Liquefaction Resistance of Granular Materials via 3D-DEM Modeling

Civil & Environmental Engineering (2024)

Thesis
Peer Reviewed

Data-Driven Object Segmentation in Single Images with Random Field Models

Yang, Jimei
Advisor(s): Yang, Ming-Hsuan

UC Merced Electronic Theses and Dissertations (2015)

As humans, we have a remarkable ability of telling objects apart from cluttered background

and tracing their contours even with occlusions. This ability has long fascinated computer vision researchers to study the principles and algorithms for object segmentation. Object segmentation has both theoretical and practical interests as it is an essential step towards 3D image understanding and intelligent image editing.

To segment an object, we have to recognize it in order to obtain knowledge of what parts should be grouped together. In this thesis, we formulate object segmentation as an image labeling problem in random field models to facilitate integrating top-down recognition knowledge with bottom-up image cues. The integration can be driven by either bottom-up segmentation or top-down recognition. The segmentation-driven process requires object-level segmentation hypotheses drawn from bottom-up cues while the recognition-driven process needs shape and context to be effectively represented. This thesis addresses these issues in a data-driven approach. First, we propose to generate object segmentation proposals from segmentation trees using exemplars. Compared to previous parametric methods, our data-driven method takes advantage of both diversity and informativeness

of exemplars and thus produce a compact set of highly plausible proposals. Second, we

propose novel random field models that enjoy joint learning of shape representation and

object segmentation. Different from previous works that use shape representation as prior,

our model emphasizes the structured prediction from the recognition model to the shape

model. This difference ensures the the shape is well preserved in the resulting segmentation

masks with robustness to partial occlusions. Third, we develop a novel nonparametric

method based on multiscale shape transfer, which in turns forms a higher-order random

field. Compared to previous works that transfer rigid or deformable masks in image subwindows, our method explores shape masks in multiple granularities and is able to produce high quality segmentations in an efficient way. The last but not least, we develop a novel scene parsing system where small objects are segmented in context. With extensive use of context in multiscale and particular care to the long-tailed label distribution, our system

demonstrates state-of-the-art results in large-scale problems.

Cover page: Data-Driven Object Segmentation in Single Images with Random Field Models

Thesis
Peer Reviewed

Who Will Go Where and When?

Shuai, Zaihong
Advisor(s): Yang, Ming-Hsuan

UC Merced Electronic Theses and Dissertations (2012)

We propose a Bayesian framework for modeling and predicting traffic patterns using information obtained from wireless sensor networks. For concreteness, we apply the proposed framework to a smart building application in which traffic patterns of humans are modeled and predicted through detection and matching of their images taken from cameras at different locations. Experiments with more than 4,000 images of 20 subjects demonstrate promising results in traffic pattern prediction using the proposed algorithm. The algorithm can also be applied to other applications including surveillance, traffic monitoring, abnor- mality detection, and location-based services. In addition, the long-term deployment of the network can be used for security, energy conservation and utilization improvement of smart buildings.

Thesis
Peer Reviewed

Learning Affinity to Parse Images

Liu, Sifei
Advisor(s): Yang, Ming-Hsuan

UC Merced Electronic Theses and Dissertations (2017)

Recent years have witnessed the success of deep learning models such as convolutional neural networks (ConvNets) for numerous vision tasks. However, ConvNets have a significant limitation: they do not have effective internal structures to explicitly learn image pairwise relations. This yields two fundamental bottlenecks for many vision problems of label and map regression, as well as image reconstruction: (a) pixels of an image have large amount of redundancies but cannot be efficiently utilized by ConvNets, which predict each of them independently, and (b) the convolutional operation cannot effectively solve problems that rely on similarities of pixel pairs, e.g., image pixel propagation and shape/mask refinement.

This thesis focuses on how to learn pairwise relations of image pixels under jointly, end-to-end learnable neural networks. Specifically, this is achieved by two different approaches: (a) formulating the conditional random field (CRF) objective as a non-structured objective that can be implemented via ConvNets as an additional loss, and (b) developing spatial propagation based deep-learning-friendly structures that learn the pairwise relations in an explicit manner.

In the first approach, we develop a novel multi-objective learning method that optimizes a single unified deep convolutional network with two distinct non-structured loss functions: one encoding the unary label likelihoods and the other encoding the pairwise label dependencies. We propose to apply this framework on face parsing, while experiments on both LFW and Helen datasets demonstrate the additional pairwise loss significantly improves the labeling performance compared to a single loss ConvNet with the same architecture.

In the second approach, we explore how to learn pairwise relations using spatial propagation networks, instead of using additional loss functions. Unlike ConvNets, the propagation module is a spatially recurrent network with a linear transformation between adjacent rows and columns. We propose two typical structures: a one-way connection using one-dimensional propagation, and a three-way connection using two-dimensional propagation. For both models, the linear weights are spatially variant output maps that can be learned from any ConvNet. Since such modules are fully differentiable, they are flexible enough to be inserted into any type of neural network. We prove that while both structures can formulate global affinities, the one-way connection constructs a sparse matrix, and the three-way forms a much denser one. While both structures demonstrate their effectiveness over a wide range of vision problems, the three-way connection is more powerful with challenging tasks (e.g., general object segmentation). We show that a well-learned affinity can benefit numerous computer vision applications, including but not limited to image filtering and denoising, pixel/color interpolation, face parsing, as well as general semantic segmentation. Compared to graphical model base pairwise learning, the spatial propagation network can be a good alternative in deep-learning based frameworks.

Cover page: Learning Affinity to Parse Images

Creative Commons 'BY-NC-SA' version 4.0 license

Thesis
Peer Reviewed

Learning to Recognise Objects and Actions for Intelligent Agents

Agarwal, Nakul
Advisor(s): Yang, Ming-Hsuan

UC Merced Electronic Theses and Dissertations (2019)

Computer vision involves a host of tasks, such as boundary detection, semantic segmentation, surface estimation, object detection, image classification, action localization, to name a few. For a holistic understanding of a scene, which is required by a lot of real-world applications, many of these tasks need to be combined together. For instance, an autonomous car should not only be able to detect other cars (object) but also if a pedestrian is walking (action). The former requires localizing the object, which can either be at the pixel level or bounding box level. The latter requires localizing the action, and by extension the actor, in both space and time. These problems are best dealt with approaches involving supervised learning models which rely on large annotated datasets, and so the problem becomes even more challenging when there is lack of labeled data.

In this thesis, we first tackle the problem of spatio-temporal action localization in an unsupervised setting. As the name suggests, it requires modeling of both spatial and temporal features. So, we propose an end-to-end learning framework for an adaptation method which aligns both spatial and temporal features and conduct experiments on the action localization task. To highlight the potential benefits for autonomous cars, we also construct and benchmark a new dataset which contains pedestrian actions collected in driving scenes. Then, for a holistic understanding of the scene, we shift our attention from localizing actions to recognising objects especially in a city street scenario. We do this by jointly dealing with the tasks of object detection and semantic segmentation. While the former localizes the individual instances of objects at the bounding box level, the latter provides pixel level distinction but at the category level. We explore a novel observation that connects the two tasks and provide an end-to-end learning framework to exploit this connection.

Cover page: Learning to Recognise Objects and Actions for Intelligent Agents

Thesis
Peer Reviewed

Learning Correspondence from Images, Videos and Texts

Xiao, Taihong
Advisor(s): Yang, Ming-Hsuan

UC Merced Electronic Theses and Dissertations (2023)

In computer vision, learning correspondence is a pivotal and fundamental challenge with far-reaching applications. Correspondence encapsulates a measure of similarity between disparate entities, spanning images, videos, and texts. As deep neural networks have demonstrated significant success in computer vision over the past few years, inferring correspondence has been posed as a representation learning task. We learn useful feature representations and infer correspondence with deep neural networks. In this thesis, we undertake the task of learning various types of correspondence and exploring their applications.

First, we consider acquiring dense low-level correspondence between successive video frames, where optical flow represents temporal pixel-level correspondences. Within the landscape of deep optical flow estimation methodologies, the cost volume emerges as a linchpin, encoding the vital pixel-level correlation information. Our contribution comes as a learnable cost volume (LCV) layer, leveraging a positive definite kernel matrix and optimizing its learning through Cayley representations. The proposed LCV is a lightweight module and can be easily plugged into existing models to replace the conventional cost volume. It reduces flow estimation errors and improves the model's robustness against illumination variations, noise, and adversarial input perturbations.

Second, we delve into semantic correspondence across distinct images, a task more challenging than optical flow estimation. Here, we confront the complexities stemming from vast variations in appearance, scale, and pose, even among objects in the same category. We introduce an affinity matrix to represent semantic similarity between images. Our novel approach harnesses multi-level contrastive learning for semantic matching. It leverages image-level contrastive learning to guide convolutional features in locating correspondence between similar objects. Further, we enhance performance through pixel-level cross-instance cycle consistency. This methodology outperforms prevailing approaches in this domain.

Finally, we explore the correspondence between images and text, crucial in vision-language foundation models bridging disparate modalities. These models employ visual and textual encoders, mapping both modalities into a shared embedding space. While pretrained representations from extensive data yield impressive zero-shot performance in tasks like image classification, their potential wanes when dealing with few examples per category. To address this challenge, we propose a category name initialization method that initializes the visual classification head with text embeddings of category names. Extensive experimental results show that the category name initialization method propels our model to achieve state-of-the-art results in various few-shot image classification benchmarks.

Cover page: Learning Correspondence from Images, Videos and Texts

Creative Commons 'BY-NC' version 4.0 license

Thesis
Peer Reviewed

Learning shape priors with neural networks

Safar, Simon
Advisor(s): Yang, Ming-Hsuan

UC Merced Electronic Theses and Dissertations (2014)

We propose two methods for object segmentation by combining learned shape priors with local features. The first, Max-Margin Boltzmann Machines, learns shapes in an unsupervised way, followed by a joint refinement using features extracted from the image, using max-margin methods. Second, we investigate the feasibility of another approach, based on deep learning and patchwise output mask refinement. As a way to further improve results, we also present an application of structured learning to learn Graph Cut based segmentation mask smoothing.

We conduct experiments on datasets containing diverse images of three classes of objects, showing promising results. We also discuss both qualitative and quantitative results extensively and point out both the strengths and shortcomings of the above approaches.

Cover page: Learning shape priors with neural networks

Thesis
Peer Reviewed

Data-Driven Visual Synthesis for Natural Image and Video Editing

Li, Yijun
Advisor(s): Yang, Ming-Hsuan

UC Merced Electronic Theses and Dissertations (2019)

Visual data are what make our daily life fun. Often times, we consume those data created by experts in related fields, e.g., appreciating artworks drawn by famous painters or watching movies shot by professional directors. How about creating the desired data that show our own feelings, ideas and creativity by ourselves? This comes to the Visual Synthesis, which is the process of synthesizing new data or altering existing data. However, attempts from large amounts of non-experts often end up deviating from the manifold of real natural data, leading to unrealistic results with undesired artifacts. The goal of all research work in this thesis is to develop effective computational models to preserve visual realism and facilitate more stunning creations. We mainly develop data-driven approaches by learning from large amounts of existing created visual data and explore effective models so that they can generalize to enormous unseen target data. Essentially, visual synthesis is working on manipulating different factors that form the final observed data, such as structure, style, content, motion and so on. Along this direction, we mainly explore four synthesis tasks for various image and video editing scenarios, including structure enhancement, style transfer, content filling and motion prediction.

Chapter 3 describes a joint filtering method on enhancing the sharpness of low-quality structures in images. The basic idea is to leverage a reference image as a prior and transfer the structural information to the target image. Chapter 4 presents how to alter the style of an image with another new style. We propose a universal style transfer algorithm that works for arbitrary style inputs. Chapter 5 focuses on how to fill in the missing content in images in order to remove occlusions. We aim at the face completion which is more challenging as it often requires generating semantically new pixels for the missing key components. In Chapter 6, we present a novel algorithm on how to generate pixel-level future frames in multiple time steps given one still image. This represents an important step towards simulating similar preplay activities that might constitute an automatic prediction mechanism in human visual cortex.

Cover page: Data-Driven Visual Synthesis for Natural Image and Video Editing

Creative Commons 'BY' version 4.0 license

Thesis
Peer Reviewed

Multi-frame Video Prediction with Learnable Temporal Motion Encodings

Jasti, Rakesh
Advisor(s): Yang, Ming-Hsuan

UC Merced Electronic Theses and Dissertations (2020)

While recent deep learning methods have made significant progress on the video prediction problem, most methods predict the immediate or a fixed number of future frames. To obtain longer-term frame predictions, existing techniques usually process the predicted frames iteratively, resulting in blurry or inconsistent predictions. In this thesis, we present a new approach that can predict an arbitrary number of future video frames with a single forward pass through the network. Instead of directly predicting a fixed number of future optical flows or frames, we learn temporal motion encodings, i.e., temporal motion basis vectors and a network to predict the coefficients. The learned motion basis can be easily extended to arbitrary length at inference time, enabling us to predict an arbitrary number of future frames. Experiments on benchmark datasets indicate that our approach performs favorably against state-of-the-art techniques even for the next frame prediction setting. When evaluated under 5-frame or 10-frame prediction settings, the proposed method obtains bigger performance gains over the existing state-of-the-art techniques that iteratively process the predictions.

1 supplemental ZIP

Cover page: Multi-frame Video Prediction with Learnable Temporal Motion Encodings