Skip to main content
eScholarship
Open Access Publications from the University of California

Compositional Generalization in Distributional Models of Semantics: Transformer-based Language Models are Architecturally Advantaged

Creative Commons 'BY' version 4.0 license
Abstract

An important aspect of language comprehension is learning and generalizing complex lexical relations. For instance, having learned that the phrase preserve cucumbers predicts vinegar and that preserve berries predicts dehydrator, one should be able to infer that the novel phrase preserve peppers is more compatible with vinegar, because pepper is more similar to cucumber. We studied the ability to perform such (compositional) generalization in distributional models trained on an artificial corpus with strict semantic regularities. We found that word-encoding models failed to learn the multi-way lexical dependencies. Recurrent neural networks learned those dependencies but struggled to generalize to novel combinations. Only mini GPT-2, a minified version of the Transformer GPT-2, succeeded in both learning and generalization. Because successful generalization in our tasks requires capturing the relationship between a phrase and a word, we argue that mini GPT-2 acquired hierarchical representations that approximate phrase structure. Our results show that, compared to older models, Transformers are architecturally advantaged to perform compositional generalization.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View