Can Generative Multimodal Models Count to Ten?
Skip to main content
eScholarship
Open Access Publications from the University of California

Can Generative Multimodal Models Count to Ten?

Abstract

The creation of sophisticated AI systems that are able to process and produce images and text creates new challenges in assessing the capabilities of those systems. We adapt a behavioral paradigm from developmental psychology to characterize the counting ability of a model that generates images from text. We show that three model scales of the Parti model (350m, 3B, and 20B parameters respectively) each have some counting ability, with a significant jump in performance between the 350m and 3B model scales. We also demonstrate that it is possible to interfere with these models' counting ability simply by incorporating unusual descriptive adjectives for the objects being counted into the text prompt. We analyze our results in the context of the knower-level theory of child number learning. Our results show that we can gain experimental intuition for how to probe model behavior by drawing from a rich literature of behavioral experiments on humans, and, perhaps most importantly, by adapting human developmental benchmarking paradigms to AI models, we can characterize and understand their behavior with respect to our own.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View