Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Embracing Data-Centric AI: Practical and Provable Solutions to Weakly Supervised Data

Creative Commons 'BY-NC' version 4.0 license
Abstract

Machine learning is a garbage-in-garbage-out system, which relies on high-quality labeled data to train models. However, in real-world scenarios, data quality issues are prevalent, leading to poor model performance and undesirable outcomes. Weakly supervised learning approaches have emerged as a promising solution to address this issue, enabling artificial intelligence (AI) systems to learn from noisy or unlabeled data. In this dissertation, we delve into data-centric AI and provide practical and provable solutions for handling weakly supervised data. Particularly, we introduce a pipeline with three important procedures to handle the data issues in weakly-supervised learning, including 1) a data diagnosis algorithm that learns the noise rates when true labels are missing, 2) a data curation algorithm that detects and fixes the corrupted labels, and 3) robust learning algorithms with the curated data. Moreover, we also discuss a multi-dimensional evaluation of model performance beyond the accuracy when the data is imperfect. All the works mentioned above have been open-sourced. The data diagnosis and curation pipeline is available at https://github.com/Docta-ai/docta.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View