Search

Scholarly Works (2 results)

Thesis
Peer Reviewed

Trustworthy Optimization of Pre-Trained Models for Healthcare: Generalizability, Adaptability, and Security

Somayajula, Sai Ashish
Advisor(s): Xie, Pengtao

UC San Diego Electronic Theses and Dissertations (2025)

Pre-trained models have demonstrated remarkable capabilities in language understanding and generation, opening new possibilities in healthcare. They show promise in mining scientific literature, analyzing large-scale healthcare data, identifying patterns in emerging diseases, and automating clinical workflows—essentially functioning as research assistants. However, general-purpose pre-trained models—typically trained on web-scale corpora—lack the clinical grounding needed for reliable deployment in healthcare. To be effective, these models must be optimized for domain-specific needs. This thesis addresses three core challenges in adapting and utilizing pre-trained models for healthcare: (i) the lack of sufficient data for fine-tuning, (ii) evolving healthcare data, and (iii) the need to ensure transparency and traceability of AI-generated content.

To address data scarcity, we propose a three-level optimization framework that fine-tunes a pre-trained model to generate high-quality synthetic data for a target task with limited data. The framework begins by adapting the pre-trained model to a related, abundant dataset, assigning a learnable weight to each training sample. These weights are iteratively updated based on feedback from (another) downstream model trained on the generated data, enabling the framework to upweight samples that contribute more to downstream performance. This feedback inherently improves the fine-tuning of the pre-trained model, leading to the generation of data that enhances downstream task performance. We demonstrate the effectiveness of this approach on a long-COVID article classification task—a challenging low-resource setting.

For the second challenge—adapting to evolving healthcare data—we propose a bi-level optimization framework that fine-tunes a model on new data by updating only a sparse subset of parameters selected for task-specific adaptation. Rest of the model is regularized to remain close to values learned from previously seen sources, helping to mitigate forgetting. To identify which parameters to update, we assign a learnable score to each one and jointly optimize these scores and their corresponding weights in a two-stage process. We impose a sparsity constraint on the scores to ensure that large updates are limited to a small subset of parameters. We evaluate this framework on an early sepsis prediction task using patient data from four real-world hospitals.

To enable traceability of AI-generated content, we propose a watermarking algorithm applied at inference time that perturbs the model’s logits to bias generation toward a subset of vocabulary tokens determined by a secret key. To ensure this biasing does not degrade generation quality, we introduce a multi-objective optimization framework that jointly learns how many tokens to bias and by how much—balancing watermark detectability with semantic coherence. The approach improves detectability, preserves text quality, and enhances robustness against a range of watermark removal attacks compared to prior methods.

Together, these contributions offer a principled framework for adapting and securely utilizing pre-trained models in real-world healthcare settings.

Cover page: Trustworthy Optimization of Pre-Trained Models for Healthcare: Generalizability, Adaptability, and Security

Article
Peer Reviewed

Improving long COVID-related text classification: a novel end-to-end domain-adaptive paraphrasing framework.

UC San Diego Previously Published Works (2024)

The emergence of long COVID during the ongoing COVID-19 pandemic has presented considerable challenges for healthcare professionals and researchers. The task of identifying relevant literature is particularly daunting due to the rapidly evolving scientific landscape, inconsistent definitions, and a lack of standardized nomenclature. This paper proposes a novel solution to this challenge by employing machine learning techniques to classify long COVID literature. However, the scarcity of annotated data for machine learning poses a significant obstacle. To overcome this, we introduce a strategy called medical paraphrasing, which diversifies the training data while maintaining the original content. Additionally, we propose a Data-Reweighting-Based Multi-Level Optimization Framework for Domain Adaptive Paraphrasing, supported by a Meta-Weight-Network (MWN). This innovative approach incorporates feedback from the downstream text classification model to influence the training of the paraphrasing model. During the training process, the framework assigns higher weights to the training examples that contribute more effectively to the downstream task of long COVID text classification. Our findings demonstrate that this method substantially improves the accuracy and efficiency of long COVID literature classification, offering a valuable tool for physicians and researchers navigating this complex and ever-evolving field.

Cover page: Improving long COVID-related text classification: a novel end-to-end domain-adaptive paraphrasing framework.

Creative Commons 'BY' version 4.0 license