Research in natural language processing (NLP) is aimed at making machines automatically understand natural language. Incorporating world knowledge semantics is becoming an increasingly crucial requirement for resolving deep, complex decisions in most NLP tasks today, e.g., question answering, syntactic parsing, coreference resolution, and relation extraction. Structured NLP corpora such as treebanks are too small to encode much of this knowledge, so instead, we turn to the vast Web, and access its information via a diverse collection of Web n-gram counts (of size 4 billion, and ~500x larger than Wikipedia). Shallow cues from this large n-gram dataset, when harnessed in a structured learning setting, help reveal deep semantics.
In this thesis, we address various important facets of the semantics problem - from indirect semantics for sentence-level syntactic ambiguities, and semantics as specific knowledge for discourse-level coreference ambiguities, to structured acquisition of semantic taxonomies from text, and fine-grained semantics such as intensity order. These facets represent structured NLP tasks which have a combinatorially large decision space. Hence, in general, we adopt a structured learning approach, incorporating surface Web-based semantic cues as intuitive features on the full space of decisions. The feature weights are then learned automatically based on a discriminative training approach. Empirically, for each facet, we see significant improvements over the corresponding state-of-the-art.
In the first part of this thesis, we show how Web-based features can be powerful cues to resolving complex syntactic ambiguities. We develop surface n-gram features over the full range of syntactic attachments, encoding both lexical affinities as well as paraphrase-based cues to syntactic structure. These features, when encoded into full-scale, discriminative dependency and constituent parsers, correct a range of error types.
In the next part, we address semantic ambiguities in discourse-level coreference resolution, again using Web n-gram features that capture a range of world knowledge cues to hypernymy, semantic compatibility, and semantic context, as well as general lexical co-occurrence. When added to a state-of-the-art coreference baseline, these Web features provide significant improvements on multiple datasets and metrics.
In the third part, we acquire the semantics itself via structured learning of hypernymy taxonomies. We adopt a probabilistic graphical model formulation which incorporates heterogeneous relational evidence about both hypernymy and siblinghood, captured by surface features based on patterns and statistics from Web n-grams and Wikipedia abstracts. Inference is based on loopy belief propagation and spanning tree algorithms. The system is discriminatively trained on WordNet sub-structures using adaptive subgradient stochastic optimization. On the task of reproducing subhierarchies of WordNet, this approach achieves substantial error reductions over previous work.
Finally, we discuss a fine-grained semantic facet - intensity order, where the relative ranks of near-synonyms such as good, great, and excellent are predicted using Web statistics of phrases like good but not excellent. We employ linear programming to jointly infer the positions on a single scale, such that individual decisions benefit from global information. When ranking English near-synonymous adjectives, this global approach gets substantial improvements over previous work on both pairwise and rank correlation metrics.