Large scale deployment of chat-based large language models (LLM) require careful evaluations to ensure these systems operate in an inclusive manner across diverse sociocultural contexts. Prior research has found that AI-driven systems can replicate and amplify existing social inequalities, such as ascribing a person who uses the pronoun "she" as less likely to be a doctor and more likely to be a homemaker. Historically marginalized communities, such as transgender and non-binary (TGNB) individuals, are particularly susceptible to these harms, as algorithmic systems often fail to represent identities that diverge from binary gender conventions.
This dissertation demonstrates the interdependence of technical and social considerations in the development of inclusive language models. In the first part, we systematically investigate the representational harms LLMs can inflict on TGNB identities. We introduce TANGO, a benchmark dataset designed to evaluate gender-inclusive competencies such as pronoun congruence and gender disclosure. Our findings reveal high misgendering rates and severe data-resource limitations, leading to poor handling of gender-diverse pronouns. To address these challenges, we propose novel mitigation techniques which center tokenization and low-resource methods, leading to significant improvements in LLM gender inclusivity.
In the second part, we uncover fundamental limitations within existing gender bias evaluation frameworks, highlighting the sociotechnical consequences of limited construct validity. Through contextually grounded evaluations based on lived TGNB experiences, we demonstrate that even LLMs explicitly aligned for safety can propagate harmful biases that go undetected by conventional evaluation frameworks. By involving the TGNB community in dataset creation and evaluation, we showcase how participatory methods can ensure that marginalized voices guide the development of more inclusive AI systems. Finally, we present SLOGAN, a framework for detecting local biases in clinical prediction tasks, illustrating how these contextually grounded techniques can address biases in various domains.Together, these findings collectively highlight promising directions for tackling LLM harms through community-informed technical and systemic mitigation strategies.