- Main
Towards Effective Visual Learning for Data-Centric Machine Vision
- Li, Pu
- Advisor(s): Liu, Xiaobai
Abstract
Over the past decade, Artificial Neural Networks (ANNs) have emerged as a prevalent approach for solving a wide range of artificial intelligence tasks, including image classification, protein structure prediction, and natural language processing. These networks simulate the structure and function of biological neural networks, consisting of interconnected nodes and learnable parameters that enable them to achieve competitive performance when trained on large-scale datasets. However, manual annotation of large-scale datasets is both expensive and time-consuming, which limits the performance and real-world application of ANNs.
This dissertation proposes a series of models and algorithms for automatically establishing new training datasets or expanding existing ones without requiring additional human annotation effort. These approaches significantly improve the efficiency of visual learning while reducing the cost of annotation. In contrast to model-centric approaches that focus on improving model architectures and training strategies, the proposed methods are data-centric, with the primary goal of generating or selecting data that can benefit model training. These approaches are applied to tasks that involve recognizing specific targets within visual representations, such as images and time-frequency spectrograms of audio data.
The proposed data-centric approaches introduce three novel learning schemes to the existing literature. The first scheme involves automatic generation of training data for visual models. To achieve this, multiple data generation methods are developed, including heuristics and learning-based generative models. The former superimposes target signals onto background scenes using manually designed rules, while the latter uses a stage-wise generative adversarial network to generate realistic data by sequentially producing backgrounds and foreground targets from random signals.
The second scheme aims to train models from an augmented dataset. Although there are various transformations that may augment training data, finding the optimal augmentation strategy from a large augmentation hyperparameter space can be challenging. To overcome this, the proposed policy-driven framework uses a reinforcement-learning model to predict the optimal sequence of transformations to be applied to each data sample.
The third scheme focuses on learning visual models from raw data and pseudo-labels, without any human annotations. To select high-quality data and reduce errors in the pseudo-labels, the proposed probability-sampling algorithm combines the assessment results of correctness, complexity, and diversity of data samples and pseudo-labels. Selected samples are then expanded with "contrastive samples" that have similar target signals but different background scenes, enabling the application of contrastive loss to provide additional guidance in model training.
The proposed methods were evaluated on multiple tasks, including whistle extraction, image classification, and scene text recognition. Extensive experiments showed that these methods achieved state-of-the-art performance on public benchmarks.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-