The objective of perceptual organization (grouping, segmentation and recognition) is to parse generic natural images into their constituent components which are respectively instances of a wide variety of visual patterns. These visual patterns are fundamentally stochastic processes governed by probabilistic models which ought to be learned from the statistics of natural images. In this paer,we review research steams from several disciplines , and divide existing models into four categories according to their semantic structures: descriptive models, causal Markov models, generative models, discriminative models. The objectives, principles, theories, and typical models are reviewed in each category. The central theme of this epistomlogical paper is to study the relationships between the four types of models and to pursue a unified mathematical framework for the conceptualization (or definition) and modeling of various visual patterns. In representation, we point out that the effective integration of descriptive and generative models is the future direction for statistical modeling. To make visual models tractable computationally, we study the causal Markov models as approximations and we observe that the discriminative models are computational heuristics for inferring generative models. Under this unified mathematical framework statistical models for various patterns should form a "continuous" spectrum - in the sense that they belong to a serial of probability families in the space of attributed graphs. Visual patterns and their parts are conceptualized as statistical ensembles governed by their models. These statistical models and concepts amount to a visual language with a hierarchy of vocabularies, which is essential for builing effective, robust, and generic vision systems.

## Type of Work

Article (19) Book (1) Theses (0) Multimedia (0)

## Peer Review

Peer-reviewed only (13)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (1)

## Publication Year

## Campus

UC Berkeley (1) UC Davis (0) UC Irvine (0) UCLA (6) UC Merced (0) UC Riverside (2) UC San Diego (1) UCSF (1) UC Santa Barbara (0) UC Santa Cruz (0) UC Office of the President (1) Lawrence Berkeley National Laboratory (11) UC Agriculture & Natural Resources (0)

## Department

Department of Statistics, UCLA (6) Bourns College of Engineering (1) Center for Environmental Research and Technology (1)

Research Grants Program Office (RGPO) (1)

## Journal

## Discipline

## Reuse License

BY-NC-ND - Attribution; NonCommercial use; No derivatives (1)

## Scholarly Works (20 results)

Natura scenes consist of a wide variety of stochastic patterns. While many patterns are represented well by statistical models in two dimensional regions as most image segmentation work assume, some other patterns are fundamentally one dimensional and thus cause major problems in segmentation. We call the former region processes and the latter curve processes. In this paper, we propose a stochastic algorithm for parsing an image into a number of region and curve processes. The paper makes the following contributions to the literature. Firstly, it presents a generative rope model for the curve processes in the form of Hidden Markov Model (HMM). The hidden layer is a Markov chain with each element being an image vase selected from an over-complete basis, such as Difference of Gaussians (DOG) or Difference of Offset Gaussians (DOOG) at various scales and orientations. The rope model accounts for the geometric smoothness and photometric coherence of the curve processes. Secondly, it integrates both 2D region models, such as textures, splines etc with 1D curve models uner the Bayes framework. Because both region and curve models are generative, they compete to explain input images in a layered representation. Thirdl, it achieves global optimization by effective Markov chain Monte Carlo methods in the sense of maximizing a posterior probability. The Markov chain consists of reversivle jumps and diffusions driven by bottom up information. The algorithm is applied to real images with satisfactory results. We verify the results through random synthesis and compare them against segmentations with region processes only.

We have learned a lot from studying the sequence of artful works of the two authors on EM/data augmentation. In this note, we will discuss some of our thoughts (or rather speculations) on the problem of vision from the perspective of missing data modeling and data augmentation.

- 1 supplemental PDF

This paper presents a class of statistical models that integrate two statistical modeling paradigms in the literature: I) Descriptive methods, such as Markov random fields and minimax entropy learning [41] and II) Generative methods, such as principal component analysis, independent component analysis [2] transformed component analysis [11], wavelet coding [27, 5], and sparse coding [30, 24]. In this paper, we demonstrate the integrated framwork by constructing a class of hierarchical models for texton patterns ) the term "texton" was coined by psychologist Julez in the early 80's). At the bottom level of the model, we assume that an observed texture image is generated by multiple hidden "texton maps", and textons on each map are translated, scaled, stretched, and oriented versions of a window function, like mini-templates or wavelet bases. The texton maps generate the observed image by occlusion or linear superposition. this bottom level of the model is generative in nature. At the top level of the model, the spatial arrangements of the textons in the texton maps are characterized by minimax entropy principle, which leads to embellished versions of Gibbs point rocess [34]. The top level of the model is descriptive in nature. We demonstrate the integrated model by a set of experiments.

Vision can be posed as a statistical learning and inference problem. As an over-simplified account, let W be a description of the outside scene in terms of ìwhat is where,î let I be the retina image, and let p(W, I) be the joint distribution of W and I. 1 Then visual learning is to learn p(W, I) from training data, and visual perception is to infer W from I based on p(W|I).

There are two major schools on visual learning and perception. One school is operation-oriented and learns the inferential process defined by p(W|I) directly, often in the form of an explicit transformation W ? F(I). This scheme is mostly used in supervised learning, where W is object category, and is given in training data. The other school is representation-oriented and learns the generative process p(W) and p(I|W) explicitly, then perception is to invert the generative process by maximizing or sampling p(W|I) ? p(W)p(I|W). In this scheme, p(W) may also be accounted for by a regularization term such as smoothness or sparsity. This scheme is often used in unsupervised learning where W is not available in training data.

Textons refer to fundamental micro-structures in generic natural images and thus constitute the basic elements in early (pre-attentive) visual perception. However, the word "textong" remains a vague concept in the literature of computer vision and visual perception, and a precist mathematical definition has yet to be found. In this article, we argue that the definition of texton should be governed by a sound mathematical model of images, and the set of textons must be learned from, or best tuned to, an image ensemble. We adopt a generative image model that an image is a superposition of bases from an over-complete dictionary, then a texton is defined as a mini-template that consists of a varying number of image bases with some geometric and photometric configurations. By analogy to physics, if image bases are like protons, neutrons and electrons, then textons are like atoms. Then a small number of textons can be learned from training images as repeating micro-structures. We report four experiments for comparison. The first experiment computes clusters in feature space of filter responses. The second transformed component analysis in both feature space and image patches. The third adopts a two-layer generative model where an image is generated by image bases and image bases are generated by textons. The fourth experiment shows textons from motion image sequences, which we call movetons.