Research Reports

With over 25 faculty and 60 students, the Department of Biostatistics today is a leader in statistical training for academia, government and industry. Faculty members collaborate with investigators in an extremely large number of diverse disciplines, and as a result biostatistics students have ample research opportunities. Our research programs in Bayesian methods, causal inference, genetics, hierarchical models, HIV/AIDS, longitudinal data analysis, phylogeny, spatial statistics and geographical information systems, survival analysis, and optimal design are well-respected nationally and internationally. We continue to grow in terms of our faculty, students and programs to meet current and future needs.

Sort By:

Show:

A Unifying Bayesian Approach for Sample Size Determination Using Design andAnalysis Priors

(2021)

On identifiability and consistency of the nugget in Gaussian spatial process models

(2021)

Spatial process models popular in geostatistics often represent the observed data as the sum of a smoothunderlying process and white noise. The variation in the white noise is attributed to measurement error,or micro-scale variability, and is called the “nugget”. We formally establish results on the identifiabilityand consistency of the nugget in spatial models based upon the Gaussian process within the framework ofin-fill asymptotics, i.e. the sample size increases within a sampling domain that is bounded. Our workextends results in fixed domain asymptotics for spatial models without the nugget. More specifically, weestablish the identifiability of parameters in the Matérn covariogram and the consistency of their maximumlikelihood estimators in the presence of discontinuities due to the nugget. We also present simulationstudies to demonstrate the role of the identifiable quantities in spatial interpolation.

Multivariate Directed Acyclic Graph Auto-Regressive (MDAGAR) models for spatial diseases mapping

(2021)

Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region. We aim to disentangle associations among the multiple diseases from spatial autocorrelation in each disease. We develop Multivariate Directed Acyclic Graphical Autoregression (MDAGAR) models to accommodate spatial and inter-disease dependence. The hierarchical construction imparts flexibility and richness, interpretability of spatial autocorrelation and inter-disease relationships, and computational ease, but depends upon the order in which the cancers are modeled. To obviate this, we demonstrate how Bayesian model selection and averaging across orders are easily achieved using bridge sampling. We compare our method with a competitor using simulation studies and present an application to multiple cancer mapping using data from the Surveillance, Epidemiology, and End Results (SEER) Program.

High-dimensional MultivariateGeostatistics: A Conjugate BayesianMatrix-Normal Approach

(2021)

Joint modeling of spatially-oriented dependent variables are commonplace in the environmentalsciences, where scientists seek to estimate the relationships among a set of environmental outcomesaccounting for dependence among these outcomes and the spatial dependence for each outcome. Suchmodeling is now sought for very large data sets where the variables have been measured at a very largenumber of locations. Bayesian inference, while attractive for accommodating uncertainties through theirhierarchical structures, can become computationally onerous for modeling massive spatial data sets becauseof their reliance on iterative estimation algorithms. This manuscript develops a conjugate Bayesianframework for analyzing multivariate spatial data using analytically tractable posterior distributions thatdo not require iterative algorithms. We discuss differences between modeling the multivariate responseitself as a spatial process and that of modeling a latent process. We illustrate the computational andinferential benefits of these models using simulation studies and real data analyses for a Vege Indicesdataset with observed locations numbering in the millions.

Cover page of Pseudo-Likelihood Based Logistic Regression forEstimating COVID-19 Infection and Case FatalityRates by Gender, Race, and Age in California

Pseudo-Likelihood Based Logistic Regression forEstimating COVID-19 Infection and Case FatalityRates by Gender, Race, and Age in California

(2021)

Spatial Factor Modeling: A BayesianMatrix-Normal Approach for Misaligned Data

(2021)

Multivariate spatially-oriented data sets are prevalent in the environmental and physical sciences.Scientists seek to jointly model multiple variables, each indexed by a spatial location, to capture anyunderlying spatial association for each variable and associations among the different dependent variables.Multivariate latent spatial process models have proved effective in driving statistical inference andrendering better predictive inference at arbitrary locations for the spatial process. High-dimensionalmultivariate spatial data, which is the theme of this article, refers to data sets where the number of spatiallocations and the number of spatially dependent variables is very large. The field has witnessed substantialdevelopments in scalable models for univariate spatial processes, but such methods for multivariate spatialprocesses, especially when the number of outcomes is moderately large, are limited in comparison. Here,we extend scalable modeling strategies for a single process to multivariate processes. We pursue Bayesianinference which is attractive for full uncertainty quantification of the latent spatial process. Our approachexploits distribution theory for the Matrix-Normal distribution, which we use to construct scalableversions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliverinference over a high-dimensional parameter space including the latent spatial process. We illustrate thecomputational and inferential benefits of our algorithms over competing methods using simulation studiesand an analysis of a massive vegetation index dataset.

Network modeling in biology: statistical methods for gene and brain networks.

(2021)

Bipartite tight spectral clustering (BiTSC) algorithm for identifying conserved gene co-clusters in two species.

(2021)

Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains

(2020)

We introduce a class of scalable Bayesian hierarchical models for the analysis of massive geostatistical datasets. The underlying idea combines ideas on high-dimensional geostatistics by partitioning the spatial domain and modeling the regions in the partition using a sparsity-inducing directed acyclic graph (DAG). We extend the model over the DAG to a well-defined spatial process, which we call the meshed Gaussian process (MGP). A major contribution is the development of an MGPs on tessellated domains, accompanied by a Gibbs sampler for the efficient recovery of spatial random effects. The source code is available at github.com/mkln/meshgp.

Statistical hypothesis testing versus machine-learning binary classification: distinctions and guidelines.

(2020)