Skip to main content
eScholarship
Open Access Publications from the University of California

Bootstrapping Syntactic Categories

Abstract

In learning the structure of a new domain, it appears necessary to simultaneously discover an appropriate set of categories and a set of rules defined over them. W e show how this bootstrapping problem m a y be solved in the case of learning syntactic categories, without making assumptions about the nature of linguistic rules. Each word is described by a vector of bigram statistics, which describe the distribution of local contexts in which it occurs; cluster analysis with respect to an appropriate similarity metric groups together words with similar distributions of contexts. Using large noisy untagged corpora of English, the resulting clusters are in good agreement with a standard linguistic analysis. A similar method is also applied to classify short sequences of words into phrasal syntactic categories. This statistical approach can be straightforwardly reahsed in a neural network, which finds syntactically interesting categories from real text, whereas the principal alternative network approach is limited to finding the categories in small artificial grammars. The general strategy, using simple statistics to find interesting categories without assumptions about the nature of the irrelevant rules defined over those categories, m a y be applicable to other domains.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View