Deep learning has revolutionized fields like computer vision, natural language processing, and multimodal learning, but its reliance on large datasets brings challenges such as rising computational costs, vulnerability to data poisoning attacks, and difficulty achieving robustness against spurious correlations.
My research addresses these challenges through a data-centric approach, improving data selection, curriculum design, and weighting strategies. This dissertation is organized into three parts. First, for efficient training, CREST identifies coresets for deep vision models with theoretical guarantees, and S2L reduces fine-tuning costs for large language models by prioritizing subsets based on proxy model loss trajectories. Second, for robust training against data poisoning, EPIC iteratively detects and excludes malicious examples during training, effectively mitigating the attacks. Finally, to address spurious correlations, SPARE mitigates these biases early in training by separating and rebalancing biased groups, PDEprogressively expands balanced subsets to guide models toward learning core features, and a multimodal fine-tuning method enhances robustness in vision-language models like CLIP by reducing reliance on spurious features, achieving significant gains in worst-group accuracy.
Together, my research demonstrates how focusing on the properties and selection of data helps address core limitations in deep learning, providing scalable and effective solutions that bridge theoretical insights with practical needs across diverse real-world applications.