An important aspect of language comprehension is learning and generalizing complex lexical relations. For instance, having learned that the phrase preserve cucumbers predicts vinegar and that preserve berries predicts dehydrator, one should be able to infer that the novel phrase preserve peppers is more compatible with vinegar, because pepper is more similar to cucumber. We studied the ability to perform such (compositional) generalization in distributional models trained on an artificial corpus with strict semantic regularities. We found that word-encoding models failed to learn the multi-way lexical dependencies. Recurrent neural networks learned those dependencies but struggled to generalize to novel combinations. Only mini GPT-2, a minified version of the Transformer GPT-2, succeeded in both learning and generalization. Because successful generalization in our tasks requires capturing the relationship between a phrase and a word, we argue that mini GPT-2 acquired hierarchical representations that approximate phrase structure. Our results show that, compared to older models, Transformers are architecturally advantaged to perform compositional generalization.