Intertwining Generalization and Memorization
Over-paramaterized neural models have become dominant in Natural Language Processing. Increasing the size of a neural network seems to result in improved performance across a a broad range of tasks. Despite their size these models have been shown to generalize poorly outside their training data. Seemingly failing to extract the systematic generalizations that humans use to generate and interpret language. Increasingly work has questioned whether these models are learning to generalize or memorize, with larger capacity models potentially just memorizing their data more and more effectively. We suggest the tradeoff between memorization and generalization may be more nuanced; with the capacity of a model shaping the kinds of generalizations they are likely to acquire. Our results on a linguistic task suggest that while all models develop generalization strategies, smaller models may arrive at a smaller distribution of strategies that generalize more robustly to novel data.