A central aspect of natural language understanding consists of linking information across multiple sentences and even combining multiple sources (for example: articles, conversations, blogs and tweets). Understanding this global information structure requires identifying the people, objects, and events as they evolve over a discourse. While natural language processing (NLP) has made great progress on sentence-level tasks such as parsing and machine translation, far less progress has been made on the processing and understanding of large units of language such as a document or a conversation.
The initial step in understanding discourse structure is to recognize the entities (people, artifacts, locations, and organizations) being discussed and track their refer- ences throughout. Entities are referred to in many ways: with proper names ("Barack Obama"), nominal descriptions ("the president"), and pronouns ("he" or "him"). Entity reference resolution is the task of deciding to which entity a textual mention refers.
Entity reference resolution is influenced by a variety of constraints, including syntactic, discourse, and semantic constraints. Even some of the earliest work (Hobbs, 1977, 1979), has recognized that while syntactic and discourse constraints can be declaratively specified, semantic constraints are more elusive. While past work has successfully learned many of the syntactic and discourse cues, there has yet to be an entity reference resolution system that exploits semantic cues and operationalizes these observations into a coherent model.
This dissertation presents unified statistical models for entity reference resolu- tion that can be learned in an unsupervised way (without labeled data) and models soft semantic constraints probabilistically along with hard grammatical constraints. While the linguistic insights which underlie this model have been observed in some of the earliest anaphora resolution literature (Hobbs, 1977, 1979), the machine learning techniques which allow these cues to be used collectively and effectively are relatively recent (Blei et al., 2003; Teh et al., 2006; Blei and Frazier, 2009). In particular, our models use recent insights into Bayesian non-parametric modeling (Teh et al., 2006) to effectively learn entity partition structure when the number of entities is not known ahead of time. The primary contribution of this dissertation is combining the linguistic observations of past researchers with modern structured machine learning techniques. The models presented herein yield state-of-the-art reference resolution results against other systems, supervised or unsupervised.