From a quick glance or the touch of an object, our brains map sensory signals to scenes composed of rich anddetailed shapes and surfaces. Unlike the standard approaches to perception, we argue that this mapping draws on internalcausal and compositional models of the physical world and these internal models underlie the generalization capacity of humanperception. Here, we present a generative model of visual and multisensory perception in which the latent variables encodeintrinsic (e.g., shape) and extrinsic (e.g., occlusion) object properties. Latent variables are inputs to causal models that outputsense-specific signals. We present a recognition network that performs efficient inference in the generative model, computingat a speed similar to online perception. We show that our model, but not alternatives, can account for human performance in anoccluded face matching task and in a visual-to-haptic face matching task.