Computational models have long been used in Cognitive Science, but to date most research has used language models trained on text. With recent advances in Computer Vision, new research is expanding to visually informed models. In this paper, we explore the potential of such models to account for human naming behavior as recorded in naming norms (where subjects are asked to name visually presented objects). We compare the performance of three representative models on a set of norms that include stimuli in the form of line drawings, colored drawings, and realistic photos. The state-of-the-art Language and Vision model CLIP, trained on both text and images, performs best. It generalizes well across different types of stimuli and achieves good overall accuracy. CLIP affords both linguistic (text-based) and visual (image-based) representations for names, and we find that textual representations outperform visual representations. This is good news, as textual representations are easier to obtain than visual representations. All in all, our results show promise for the use of Computer Vision and Language and Vision models in Cognitive Science.