We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates

## Type of Work

Article (29) Book (0) Theses (11) Multimedia (0)

## Peer Review

Peer-reviewed only (19)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (0)

## Publication Year

## Campus

UC Berkeley (0) UC Davis (0) UC Irvine (0) UCLA (40) UC Merced (0) UC Riverside (0) UC San Diego (0) UCSF (0) UC Santa Barbara (0) UC Santa Cruz (0) UC Office of the President (0) Lawrence Berkeley National Laboratory (0) UC Agriculture & Natural Resources (0)

## Department

Department of Statistics, UCLA (29)

## Journal

## Discipline

Physical Sciences and Mathematics (8)

## Reuse License

## Scholarly Works (40 results)

This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufficient conditions for the solution to be unbiased.

This paper analyzes generalization of the classic Rescorla-Wagner (R- W) learning algorithm and studies their relationship to Maximum Like- lihood estimation of causal parameters. We prove that the parameters of two popular causal models, ?P and P C , can be learnt by the same generalized linear Rescorla-Wagner (GLRW) algorithm provided gener- icity conditions apply. We characterize the fixed points of these GLRW algorithms and calculate the fluctuations about them, assuming that the input is a set of i.i.d. samples from a fixed (unknown) distribution. We describe how to determine convergence conditions and calculate conver- gence rates for the GLRW algorithms under these conditions.

We describe a hierarchical compositional system for detecting de- formable objects in images. Objects are represented by graphical models. The algorithm uses a hierarchical tree where the root of the tree corre- sponds to the full object and lower-level elements of the tree correspond to simpler features. The algorithm proceeds by passing simple messages up and down the tree. The method works rapidly, in under a second, on 320 × 240 images. We demonstrate the approach on detecting cat- s, horses, and hands. The method works in the presence of background clutter and occlusions. Our approach is contrasted with more traditional methods such as dynamic programming and belief propagation.

Attention mechanisms have recently been introduced in deep learning for various tasks in natural language processing and computer vision. But despite their popularity, the ``correctness'' of the implicitly-learned attention maps has only been assessed qualitatively by visualization of several examples. In this paper we focus on evaluating and improving the correctness of attention in neural image captioning models. Specifically, we propose a quantitative evaluation metric for how well the attention maps align with human judgment, using recently released datasets with alignment between regions in images and entities in captions. We then propose novel models with different levels of explicit supervision for learning attention maps during training. The supervision can be strong when alignment between regions and caption entities are available, or weak when only object segments and categories are provided. We show on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality.

This paper gives an algorithm for detecting and reading text in natural images. The algorithm is intended for use by blind and visually impaired subjects walking through city scenes. We first obtain a dataset of city images taken by blind and normally sighted subjects. From this dataset, we manually label and extract the text regions. Next we perform statistical analysis of the text regions to determine which image features are reliable indicators of text and have low entropy (i.e. feature response is similar for all text images). We obtain weak classifiers by using joint probabilities for feature responses on and off text. These weak classifiers are used as input to an AdaBoost machine learning algorithm to train a strong classifier. In practice, we trained a cascade with 4 strong classifiers containg 79 features. An adaptive binarization and extension algorithm is applied to those regions selected by the cascade classifier. An commercial OCR software is used to read the text or reject it as a non-text region. The overall algorithm has a success rate of over 90% (evaluated by complete detection and reading of the text) on the test set and the unread text is typically small and distant from the viewer.

Our paper has two main contributions. Firstly, it presents a model for image sequences motivated by an im- age encoding perspective. It models accreted regions, where objects appear, as well as motion and motion boundaries. We formulate the problem as probabilistic inference using prior models of images and the motion field. Secondly, it introduces a new algorithm for motion estimation based on Swendsen-Wang Cuts, which performs inference on the image sequence model using bottom-up proposals to guide the search. The algorithm proceeds by first estimating the motion without the boundaries, and then by clustering in the velocity space to obtain initial estimates of the motion boundaries. The algorithm performs MAP estimation by evolving the motion boundaries by a stochastic boundary diffusion algorithm, while improving the motion estimates. Our approach is illustrated on real images of city scenes and on simulated data and can deal with large motions (even 10 pixels or more per frame). We show a brief com- parison of Swendsen-Wang Cuts with Graph Cuts and Belief Propagation on the related stereo matching problem.

It was a dream to make computers intelligent. Like humans who are capable of understanding information of multiple modalities such as video, text, audio, etc., teaching computers to jointly understand multi-modal information is a necessary and essential step towards artificial intelligence. And how to jointly represent multi-modal information is critical to such step. Although a lot of efforts have been devoted to exploring the representation of each modality individually, it is an open and challenging problem to learn joint multi-modal representation.

In this dissertation, we explore joint image-text representation models based on Visual-Semantic Embedding (VSE). VSE has been recently proposed and shown to be effective for joint representation. The key idea is that by learning a mapping from images into a semantic space, the algorithm is able to learn a compact and effective joint representation. However, existing approaches simply map each text concept and each whole image to single points in the semantic space. We propose several novel visual-semantic embedding models that use (1) text concept modeling, (2) image-level modeling, and (3) object-level modeling. In particular, we first introduce a novel Gaussian Visual-Semantic Embedding (GVSE) model that leverages the visual information to model text concepts as density distributions rather than single points in semantic space. Then, we propose Multiple Instance Visual-Semantic Embedding (MIVSE) via image-level modeling, which discovers and maps the semantically meaningful image sub-regions to their corresponding text labels. Next, we present a fine-grained object-level representation in images, Scene-Domain Active Part Models (SDAPM), that reconstructs and characterizes 3D geometric statistics between object’s parts in 3D scene-domain. Finally, we explore advanced joint representations for other visual and textual modalities, including joint image-sentence representation and joint video-sentence representation.

Extensive experiments have demonstrated that the proposed joint representation models are superior to existing methods on various tasks involving image, video and text modalities, including image annotation, zero-shot learning, object and parts detection, pose and viewpoint estimation, image classification, text-based image retrieval, image captioning, video annotation, and text-based video retrieval.

This thesis presents methods and results to solve the problem of joint object recognition and reconstruction. The proposed solution is a dictionary of deformable image patches and a hierarchical model encoding spatial compositions. Both the dictionary and the composition model are learned from data without supervision. The patch dictionary is shown to achieve state-of-art performance on digit recognition while capable of high-quality reconstruction. The hierarchical model is shown to account for human chunk learning behavior not captured by previous theories. Both learning algorithms are significantly faster and easier to use than previous methods of similar purpose.