- Main
Gaussian Models: Regularization, Imputation, and Emulation
- Wang, Yuanbo
- Advisor(s): Oh, Sang-Yun SO
Abstract
Recent years have seen great advances in using Gaussian graphical models to characterize the conditional relationship among variables in many domains of study. In particular, many methods have been proposed for estimating the inverse covariance matrix. Along this line of research, glasso (graphical lasso, proposed by Friedman et al. (2008)) provides an $l_1$-regularized maximum likelihood estimator. One challenge in such regularization-based methods is determining the scalar tuning parameter that balances the model complexity and fit to the data, the latter frequently based on the likelihood. When working in high dimensions, traditional model selection methods such as $k$-fold cross-validation, Bayesian information criterion, and Akaike's information criterion can be challenging to apply for several reasons. First, the computation can be prohibitively expensive when estimating high-dimensional inverse covariances multiple times. In addition, reasonable search grids for candidate penalty parameter values can vary considerably across applications. Substantial effort is required to find reasonable search ranges for different applications. Furthermore, using homogeneous regularization for all entries in the inverse covariance matrices can be limiting.
To address these challenges, we first propose block-wise robust selection (BRS), a tuning method based on distributionally robust optimization for selecting block-wise regularization parameters in the glasso estimator. This method finds adaptive penalty parameters for different blocks in the inverse covariance matrix, where the blocks are determined based on data dispersion. In this formulation, the previous penalty parameter search in an arbitrary range now becomes a search of significance level within the fixed interval of $[0,1]$, regardless of the application of interest. Our method is computationally efficient and does not require data normalization prior to estimating the inverse covariance matrix.
Next, we demonstrate the application of BRS to the problem of climate field reconstruction, which aims to reconstruct the past temperature evolution by making use of the measurements in the post-instrumental period and partial records in the pre-instrumental times. The reconstruction can be viewed as a missing value imputation task. In these applications, we first use a Gaussian graphical model tuned via BRS to characterize the spatial field over the globe and then perform the imputation. In addition, we explore different clustering methods for grouping the variables in BRS. The reconstruction results confirm that our method is computationally attractive and provides similar imputed values when compared to using a graph tuned by environmental scientists. Furthermore, BRS can be used flexibly with different variable grouping methods.
Finally, we consider the emulation of physics-based simulators for environmental processes, leveraging Gaussian Processes. Our goal is to develop a computationally efficient surrogate model to closely approximate the outputs of the physics-based environmental simulator, which is expensive to run. Specifically, we develop an emulator for the Regional Hydro-Ecologic Simulation System (RHESSys) simulator. Our emulator leverages Gaussian Processes with embedded seasonality within the mean and a separable covariance. The emulator provides an efficient way to approximate the output that would be obtained by running the physics-based simulator at a substantially lower computational cost than running the simulator.Our emulator approximates environmental time series that would be generated by the physics-based model, e.g., streamflow, under different hydrological and ecological scenarios, e.g., different soil properties. In addition, the degree of approximation and computation efficiency of our built emulator enables us to conduct a global sensitivity analysis on the input-output relationship of the environmental process of interest and identify the key influential environmental factors. Without our emulator, such an analysis would be intractable (or very expensive), as one would need to run the physics-based simulator multiple times for various input settings, which is very costly.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-