Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Auditing and Mitigating Safety Risks in Large Language Models

Abstract

Large language Models (LLMs, e.g., GPT-3, OPT, TNLG,…) are shown to have a remarkably high performance on standard benchmarks, due to their high parameter count, extremely large training datasets, and significant compute. Although the high parameter count in these models leads to more expressiveness, it can also lead to higher memorization, which, coupled with large unvetted, web-scraped datasets can cause multiple different negative societal and ethical impacts: leakage of private, sensitive information, generation of biased, hateful or stereotypical text and much more. For example, it has been shown that given the appropriate prompt, full training set instances (including people's names and phone numbers) can be extracted from GPT-2. One might argue that since scraped data is public data, this is not a concern. However, public data is not necessarily publicly-intended data, and there could be leaked information online. Further, LLMs are shown to have disparate, unfair performance on different speakers and subjects of speech based on their properties (religion, gender identity, ethnicity, etc.), and to display bias against certain subgroups. for instance show that dialogue models display persona biases, attributing jobs to people based on their demographics. In this dissertation we strive to address such problems through work falling into three main categories: first we discuss privacy auditing -- attempting to discover different ways in which language models could leak information, and analyze memorization in such models so that we can better prevent possible ramifications (Chapter 1). Then, we propose mechanisms for protecting privacy -- addressing existing threats, and preventing leakage of private training data information from models (Chapter 2), and finally, we propose attribute controlled text revision and generation methods to address biased and stereotypical generations, and help hide demographic attributes and address fairness concerns (Chapter 3).

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View