- Main
Novel Machine Learning and Statistical Models for High Dimensional and Observational Study Data: Applications to HIV genetic linkage network, fMRI and Survey Data
Abstract
The objective of this dissertation is to develop novel statistical models for modeling different types of high-dimensional data such as large-scale survey data, HIV genetic linkage network data and fMRI data. This dissertation is compromised of five parts. The Mann-Whitney-Wilcoxon rank sum test (MWWRST) is called for when two-sample t-tests fail to provide meaningful results, as they are highly sensitive to outliers. In the first chapter, we develop an approach to extend the MWWRST to survey data to test the null of equal mean rank. Akin to the goal of modeling paired subjects' outcomes, or between-subject outcomes in MWWRST, in the second chapter, we model the probability of HIV genetic linkage by using semiparametric functional response models (FRM). We apply the proposed method to study the genetic linkage between and within villages in Botswana from the Botswana Combination Prevention Project (BCPP), which is a cluster randomized study to implement interventions to prevent and control HIV transmission in Botswana. Since BCPP is a survey study with nonresponse, we adopt the doubly robust estimator to address the missing data problem.
During the COVID-19 pandemic, at UCSD, daily high-resolution wastewater surveillance at the building level is being used to identify potential undiagnosed infections and trigger notification of residents and responsive testing, but the optimal determinants for notifications are unknown. To fill this gap, we propose a framework for identifying features of a series of wastewater test results that can predict the presence of COVID-19 in residences associated with the test sites by using classification/decision tree models. This collaborative work also motivates us to study the asymptotic properties of an ensemble of multiple classification trees, random forests model, and extend it to model between-subject outcomes in the next chapter.
Finally, my research on high-dimensional data also includes work on functional magnetic resonance imaging (fMRI). To detect peaks and identify the locations of peaks in fMRI data, we develop a Monte Carlo method to compute the height distribution of local maxima of a stationary Gaussian or Gaussian-derived random field that is observed on a regular lattice.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-