Large language models have advanced the state-of-the-art in natural language processing and achieved success in tasks such as summarization, question answering, and text classification. However, these models are trained on large-scale datasets, which may include harmful information. Studies have shown that as a result, the models can exhibit social biases and generate misinformation after training. This dissertation discusses research on analyzing and interpreting the risks of large language models across the areas of fairness, trustworthiness, and safety.
The first part of this dissertation analyzes issues of fairness related to social biases in large language models. We first investigate issues of dialect bias pertaining to African American English and Standard American English within the context of text generation. We also analyze a more complex setting of fairness: cases in which multiple attributes affect each other to form compound biases. This is studied in relation to gender and seniority attributes.
The second part focuses on trustworthiness and the spread of misinformation across different scopes: prevention, detection, and memorization. We describe an open-domain question-answering system for emergent domains that uses various retrieval and re-ranking techniques to provide users with information from trustworthy sources. This is demonstrated in the context of the emergent COVID-19 pandemic. We further work towards detecting potential online misinformation through the creation of a large-scale dataset that expands misinformation detection into the multimodal space of image and text. As misinformation can be both human-written and machine-written, we investigate the memorization and subsequent generation of misinformation through the lens of conspiracy theories.
The final part of the dissertation describes recent work in AI safety regarding text that may lead to physical harm. This research analyzes covertly unsafe text across various language modeling tasks including generation, reasoning, and detection.
Altogether, this work sheds light on the undiscovered and underrepresented risks in large language models. This can advance current research toward building safer and more equitable natural language processing systems. We conclude with discussions of future research in Responsible AI that expand upon work in the three areas.