Vision can be posed as a statistical learning and inference problem. As an over-simplified account, let W be a description of the outside scene in terms of ìwhat is where,î let I be the retina image, and let p(W, I) be the joint distribution of W and I. 1 Then visual learning is to learn p(W, I) from training data, and visual perception is to infer W from I based on p(W|I).
There are two major schools on visual learning and perception. One school is operation-oriented and learns the inferential process defined by p(W|I) directly, often in the form of an explicit transformation W ? F(I). This scheme is mostly used in supervised learning, where W is object category, and is given in training data. The other school is representation-oriented and learns the generative process p(W) and p(I|W) explicitly, then perception is to invert the generative process by maximizing or sampling p(W|I) ? p(W)p(I|W). In this scheme, p(W) may also be accounted for by a regularization term such as smoothness or sparsity. This scheme is often used in unsupervised learning where W is not available in training data.