Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

High-Dimensional Inference and Uncertainty Quantification for Variable Selection, Clustering and Object-oriented Analysis with Bayesian and Approximate Bayesian Methods

Abstract

Bayesian computation of High-Dimensional problems using Markov Chain Monte Carlo (MCMC) or its variants can be extremely slow or completely prohibitive since these methods perform costly computations at each iteration of the sampling chain. While some non-Bayesian alternatives have been somewhat successful in estimation, they struggle to provide uncertainty quantification. These problems are aggravated if the data size is large. To address these problems, the first chapter proposes a novel dynamic feature partitioned regression (DFP) for efficient online inference for high dimensional linear regressions with large or streaming data. DFP constructs a pseudo posterior density of the parameters at every time point and quickly updates the pseudo posterior when a new block of data (data shard) arrives. DFP updates the pseudo posterior at every time point suitably and partitions the set of parameters to exploit parallelization for efficient posterior computation. The proposed approach is applied to high dimensional linear regression models with Gaussian scale mixture priors and spike and slab priors on large parameter spaces, along with large data, and yields state-of-the-art inferential performance. Over time, the algorithm enjoys theoretical support, as pseudo posterior densities get arbitrarily close to the full posterior as the data size grows, as shown in the appendix.

While the first chapter advances methodology for ordinary high dimensional regression, the second chapter focuses on regressions with multiple objects as predictors. Clinical researchers often collect multiple images from separate modalities (sources) to investigate fundamental questions of human health that are inadequately explained by considering one image source at a time. Viewing the collection of images as multiple objects, the successful integration of multi-object data produces a sum of information greater than the individual parts. This chapter is motivated by a multi-modal imaging application where structural/anatomical information from grey matter (GM) and brain connectivity information in the form of a brain connectome network from functional magnetic resonance imaging (fMRI) are available for multiple subjects. The primary goal in this chapter is to develop a regression model to predict a scalar response from multiple objects and to identify regions significantly related to the response. Existing Bayesian regression literature with multi-object predictors either ignores the topology of some/all of these objects or does not adequately make use of the information shared by multiple object predictors. In contrast, this chapter develops a flexible Bayesian regression framework exploiting network information of the brain connectome while leveraging linkages among connectome network and anatomical information from GM to draw inference on significant ROIs and offer predictive inference on the response. The principled Bayesian framework allows precise characterization of the uncertainty in ascertaining a region as influential for predicting the response and the quantification of predictive uncertainty for the response. The framework is implemented using an efficient Markov Chain Monte Carlo algorithm. Empirical results in simulation studies illustrate substantial inferential and predictive gains of the proposed framework over its popular competitors.

While the first two chapters focus on high-dimensional and object-oriented regressions, the third chapter offers a novel clustering technique for high-dimensional tensors with limited sample size. Clustering of high-dimensional tensors with limited sample size has become prevalent in a variety of application areas. Existing Bayesian model-based clustering of tensors yields less accurate clusters when the tensor dimensions are sufficiently large, the sample size is small, and clusters of tensors mainly reveal differences in their variability. This chapter develops a novel clustering technique for high dimensional tensors with limited sample sizes when the clusters show differences in their covariances rather than their means. The proposed approach constructs several matrices from a tensor to adequately estimate its variability along with different modes and implements a model-based approximate Bayesian clustering algorithm with the matrices, thus constructed with the original tensor data. Although some information in the data is discarded, we gain substantial computational efficiency and accuracy in clustering. The simulation study assesses the proposed approach and its competitors in terms of estimating the number of clusters, identifying the modal cluster membership, and the probability of misclassification in clustering (a measure of uncertainty in clustering).Clustering of tensors obtained from EEG data demonstrates an advantage of the proposed approach vis-a-vis its competitors.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View