- Main
Embracing Data-Centric AI: Practical and Provable Solutions to Weakly Supervised Data
- Zhu, Zhaowei
- Advisor(s): Liu, Yang YL
Abstract
Machine learning is a garbage-in-garbage-out system, which relies on high-quality labeled data to train models. However, in real-world scenarios, data quality issues are prevalent, leading to poor model performance and undesirable outcomes. Weakly supervised learning approaches have emerged as a promising solution to address this issue, enabling artificial intelligence (AI) systems to learn from noisy or unlabeled data. In this dissertation, we delve into data-centric AI and provide practical and provable solutions for handling weakly supervised data. Particularly, we introduce a pipeline with three important procedures to handle the data issues in weakly-supervised learning, including 1) a data diagnosis algorithm that learns the noise rates when true labels are missing, 2) a data curation algorithm that detects and fixes the corrupted labels, and 3) robust learning algorithms with the curated data. Moreover, we also discuss a multi-dimensional evaluation of model performance beyond the accuracy when the data is imperfect. All the works mentioned above have been open-sourced. The data diagnosis and curation pipeline is available at https://github.com/Docta-ai/docta.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-