A glance at an object is often sufficient to recognize it and
recover fine details of its shape and appearance, even under
highly variable viewpoint and lighting conditions. How can
vision be so rich, but at the same time fast? The analysisby-
synthesis approach to vision offers an account of the richness
of our percepts, but it is typically considered too slow
to explain perception in the brain. Here we propose a version
of analysis-by-synthesis in the spirit of the Helmholtz machine
(Dayan, Hinton, Neal, & Zemel, 1995) that can be implemented
efficiently, by combining a generative model based
on a realistic 3D computer graphics engine with a recognition
model based on a deep convolutional network. The recognition
model initializes inference in the generative model, which
is then refined by brief runs of MCMC. We test this approach
in the domain of face recognition and show that it meets several
challenging desiderata: it can reconstruct the approximate
shape and texture of a novel face from a single view, at a level
indistinguishable to humans; it accounts quantitatively for human
behavior in “hard” recognition tasks that foil conventional
machine systems; and it qualitatively matches neural responses
in a network of face-selective brain areas. Comparison to other
models provides insights to the success of our model.