We describe a computational model of humans' ability to
provide a detailed interpretation of a scene’s components.
Humans can identify in an image meaningful components
almost everywhere, and identifying these components is an
essential part of the visual process, and of understanding the
surrounding scene and its potential meaning to the viewer.
Detailed interpretation is beyond the scope of current
models of visual recognition. Our model suggests that this is
a fundamental limitation, related to the fact that existing
models rely on feed-forward but limited top-down
processing. In our model, a first recognition stage leads to
the initial activation of class candidates, which is
incomplete and with limited accuracy. This stage then
triggers the application of class-specific interpretation and
validation processes, which recover richer and more
accurate interpretation of the visible scene. We discuss
implications of the model for visual interpretation by
humans and by computer vision models