Text classification is one of the most fundamental tasks in Natural Language Processing. How to effectually utilize the unlabeled dataset in text classification and apply weakly supervised learning methods to further improve the performance based on the existing labeled dataset, especially for supervision-starved tasks (hard to obtain high-quality labeled data), is challenging. In this PhD thesis, we show several studies of weakly supervised learning methods in text classification.
We first focus on improving the accuracy and interpretability in text classification tasks using weakly supervised learning methods with the help of unlabeled dataset. More specifically, we proposed several new methods to further improve the accuracy and interpretability on both of two main research directions in weakly supervised learning methods: learning with noisy labels and semi-supervised learning. For learning with noisy labels, we proposed two weakly supervised learning aided methods on the special supervision-starved text classification task: Research Replication Prediction. For semi-supervised learning, we presented a new weakly interpretable model to improve the interpretability on the long text classification tasks. We also proposed a new ensemble method to assign better pseudo or noisy labels to the samples in the unlabeled dataset for semi-supervise learning methods.
Furthermore, we conducted the research on fairness on weakly supervised learning. More specifically, we reveal the disparate impacts in different sub-populations (e.g., race and gender) when applying the semi-supervised learning methods. Finally, we also contribute a weakly supervised learning benchmark (Research Replication Prediction) to the community.