Modeling Dependence in Large and Complex Data Sets
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Barbara

UC Santa Barbara Electronic Theses and Dissertations bannerUC Santa Barbara

Modeling Dependence in Large and Complex Data Sets

Abstract

Classical statistical theory mostly focuses on independent samples that reside in finite dimensional vector spaces. While such methods are often appropriate and yield fruitful results, practical data analyses often go beyond the scope of these classical settings. In particular, with technological advancements, the computing power to record large volume of data points at a high frequency is becoming more accessible than ever before. The large volume of data sets makes it possible to produce metadata on sample points\textemdash such as distributions, networks, or shapes, to name a few, and the high frequency of data records enables one to model data dependency structures at a fine temporal and/or spatial resolution that would not have been possible with sparsely recorded data. In the age of big data, the study of data atoms which constitute complex data objects and the statistical modeling of high resolution signals endowed with rich dependency structures are hitting their stride.

In this dissertation, we consider two specific instances of such big data. One is time dependent distributional data represented by the corresponding probability density functions. Indeed, data consisting of time-indexed distributions of cross-sectional or intraday returns have been extensively studied in finance, and provide one example in which the data atoms consist of serially dependent probability distributions. Motivated by such data, we propose an autoregressive model for density time series by exploiting the tangent space structure on the space of distributions that is induced by the Wasserstein metric. The densities themselves are not assumed to have any specific parametric form, leading to flexible forecasting of future unobserved densities. The main estimation targets in the order-$p$ Wasserstein autoregressive model are Wasserstein autocorrelations and the vector-valued autoregressive parameter. We propose suitable estimators and establish their asymptotic normality, which is verified in a simulation study. The new order-p Wasserstein autoregressive model leads to a prediction algorithm, which includes a data driven order selection procedure. Its performance is compared to existing prediction procedures via application to four financial return data sets, where a variety of metrics are used to quantify forecasting accuracy. For most metrics, the proposed model outperforms existing methods in two of the data sets, while the best empirical performance in the other two data sets is attained by existing methods based on functional transformations of the densities. The second instance is the brain functional magnetic resonance imaging (fMRI) signals that are contaminated by spatiotemporal noise at the voxel level. Such data feature a rich spatiotemporal dependency structure due to a fine acquisition resolution. In neuroscience studies, resting state brain functional connectivity quantifies the similarity between pairs of brain regions, each of which consists of voxels at which dynamic signals are acquired via neuroimaging techniques, for example, the blood-oxygen-level-dependent (BOLD) signals that quantify an fMRI scan. Pearson correlation and similar metrics have been adopted to estimate inter-regional connectivity, often through averaging of signals within regions. However, dependencies between signals within each region and the presence of noise contaminate such inter-regional correlation estimates. We propose a mixed-effects model with a simple spatiotemporal covariance structure that explicitly isolates the different sources of variability in the observed BOLD signals, including correlated regional signals, local spatiotemporal noise, and measurement error. Methods for tackling the computational challenges associated with restricted maximum likelihood estimation will be discussed. Large sample properties are established by posing mild and practically verifiable sufficient conditions. Simulation results demonstrate that the parameters of the proposed model can be accurately estimated and is superior to the Pearson correlation of averages in the presence of spatiotemporal noise. The model was also implemented on data collected from a dead rat and an anesthetized live rat. Brain networks were constructed from estimated model parameters. Large scale parallel computing and GPU acceleration were implemented to speed up connectivity estimation.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View