Skip to main content
Open Access Publications from the University of California

Research Reports

With over 25 faculty and 60 students, the Department of Biostatistics today is a leader in statistical training for academia, government and industry. Faculty members collaborate with investigators in an extremely large number of diverse disciplines, and as a result biostatistics students have ample research opportunities. Our research programs in Bayesian methods, causal inference, genetics, hierarchical models, HIV/AIDS, longitudinal data analysis, phylogeny, spatial statistics and geographical information systems, survival analysis, and optimal design are well-respected nationally and internationally. We continue to grow in terms of our faculty, students and programs to meet current and future needs.

Cover page of Bayesian modeling and uncertainty quantificationfor descriptive social networks

Bayesian modeling and uncertainty quantificationfor descriptive social networks


This article presents a simple and easily implementableBayesian approach to model and quantify uncertainty insmall descriptive social networks. While statistical methodsfor analyzing networks have seen burgeoning activity overthe last decade or so, ranging from social sciences to genetics,such methods usually involve sophisticated stochasticmodels whose estimation requires substantial structure andinformation in the networks. At the other end of the analyticspectrum, there are purely descriptive methods based uponquantities and axioms in computational graph theory. In socialnetworks, popular descriptive measures include, but arenot limited to, the so called Krackhardt’s axioms. Anotherapproach, recently gaining attention, is the use of PageRankalgorithms. While these descriptive approaches provide insightinto networks with limited information, including smallnetworks, there is, as yet, little research detailing a statisticalapproach for small networks. This article aims to contributeat the interface of Bayesian statistical inference andsocial network analysis by offering practicing social scientistsa relatively straightforward Bayesian approach to accountfor uncertainty while conducting descriptive social networkanalysis. The emphasis is on computational feasibility andeasy implementation using existing R packages, such as snaand rjags, that are available from the Comprehensive RArchive Network ( We analyzea network comprising 18 websites from the US and UK todiscern transnational identities, previously analyzed usingdescriptive graph theory with no uncertainty quantification,using fully Bayesian model-based inference.

  • 1 supplemental PDF
Cover page of Meta-Kriging: Scalable Bayesian Modeling andInference for Massive Spatial Datasets

Meta-Kriging: Scalable Bayesian Modeling andInference for Massive Spatial Datasets


Spatial process models for analyzing geostatistical data entail computations that become prohibitive asthe number of spatial locations becomes large. There is a burgeoning literature on approaches for analyzinglarge spatial datasets. In this article, we propose a divide-and-conquer strategy within the Bayesianparadigm. We partition the data into subsets, analyze each subset using a Bayesian spatial process model,and then obtain approximate posterior inference for the entire dataset by combining the individual posteriordistributions from each subset. Importantly, as often desired in spatial analysis, we offer full posteriorpredictive inference at arbitrary locations for the outcome as well as the residual spatial surface afteraccounting for spatially oriented predictors. We call this approach “spatial meta-kriging” (SMK). We do notneed to store the entire data in one processor, and this leads to superior scalability. We demonstrate SMKwith various spatial regression models including Gaussian processeswithMatern and compactly supportedcorrelation functions. The approach is intuitive, easy to implement, and is supported by theoretical resultspresented in the supplementary material available online. Empirical illustrations are provided using differentsimulation experiments and a geostatistical analysis of Pacific Ocean sea surface temperature data.Supplementary materials for this article are available online.

  • 1 supplemental PDF
Cover page of Spatial Joint Species Distribution Modeling usingDirichlet Processes

Spatial Joint Species Distribution Modeling usingDirichlet Processes


Species distribution models usually attempt to explain presence-absenceor abundance of a species at a site in terms of the environmental features (socalledabiotic features) present at the site. Historically, such models have consideredspecies individually. However, it is well-established that species interactto influence presence-absence and abundance (envisioned as biotic factors). Asa result, there has been substantial recent interest in joint species distributionmodels with various types of response, e.g., presence-absence, continuous andordinal data. Such models incorporate dependence between species response asa surrogate for interaction.The challenge we address here is how to accommodate such modeling in thecontext of a large number of species (e.g., order 102) across sites numbering on theorder of 102 or 103 when, in practice, only a few species are found at any observedsite. Again, there is some recent literature to address this; we adopt a dimensionreduction approach. The novel wrinkle we add here is spatial dependence. Thatis, we have a collection of sites over a relatively small spatial region so it isanticipated that species distribution at a given site would be similar to that at a nearby site. Specifically, we handle dimension reduction through Dirichletprocesses, enabling clustering of species, joined with spatial dependence acrosssites through Gaussian processes.We use both simulated data and a plant communities dataset for the CapeFloristic Region (CFR) of South Africa to demonstrate our approach. The latterconsists of presence-absence measurements for 639 tree species at 662 locations.Through both data examples we are able to demonstrate improved predictiveperformance using the foregoing specification.

Cover page of Multivariate spatial meta kriging

Multivariate spatial meta kriging


This work extends earlier work on spatial meta kriging for the analysis of large multivariatespatial datasets as commonly encountered in environmental and climate sciences. Spatialmeta-kriging partitions the data into subsets, analyzes each subset using a Bayesianspatial process model and then obtains approximate posterior inference for the entiredataset by optimally combining the individual posterior distributions from each subset.Importantly, as is often desired in spatial analysis, spatial meta kriging offers posteriorpredictive inference at arbitrary locations for the outcome as well as the residual spatialsurface after accounting for spatially oriented predictors. Our current work explores spatialmeta kriging idea to enhance scalability of multivariate spatial Gaussian process modelthat uses linear model co-regionalization (LMC) to account for the correlation betweenmultiple components. The approach is simple, intuitive and scales multivariate spatialprocess models to big data effortlessly. A simulation study reveals inferential and predictiveaccuracy offered by spatial meta kriging on multivariate observations.

Cover page of Practical Bayesian Modeling and Inference for Massive SpatialDatasets On Modest Computing Environments

Practical Bayesian Modeling and Inference for Massive SpatialDatasets On Modest Computing Environments


With continued advances in Geographic Information Systems and related computationaltechnologies, statisticians are often required to analyze very large spatialdatasets. This has generated substantial interest over the last decade, already toovast to be summarized here, in scalable methodologies for analyzing large spatialdatasets. Scalable spatial process models have been found especially attractive dueto their richness and flexibility and, particularly so in the Bayesian paradigm, due totheir presence in hierarchical model settings. However, the vast majority of researcharticles present in this domain have been geared toward innovative theory or morecomplex model development.Very limited attention has been accorded to approachesfor easily implementable scalable hierarchical models for the practicing scientist orspatial analyst. This article devises massively scalable Bayesian approaches that canrapidly deliver inference on spatial process that are practically indistinguishable frominference obtained using more expensive alternatives. A key emphasis is on implementationwithin very standard (modest) computing environments (e.g., a standarddesktop or laptop) using easily available statistical software packages without requiringmessage-parsing interfaces or parallel programming paradigms. Key insights areoffered regarding assumptions and approximations concerning practical efficiency.

Cover page of Toward a Diagnostic Toolkit for Linear Models with Gaussian-ProcessDistributed Random Effects

Toward a Diagnostic Toolkit for Linear Models with Gaussian-ProcessDistributed Random Effects


Gaussian processes (GPs) are widely used as distributions of random effects in linear mixed models, which are fitusing the restricted likelihood or the closely related Bayesian analysis. This article addresses two problems. First, we proposetools for understanding how data determine estimates in these models, using a spectral basis approximation to the GP underwhich the restricted likelihood is formally identical to the likelihood for a gamma-errors GLM with identity link. Second,to examine the data’s support for a covariate and to understand how adding that covariate moves variation in the outcomey out of the GP and error parts of the fit, we apply a linear-model diagnostic, the added variable plot (AVP), both to theoriginal observations and to projections of the data onto the spectral basis functions. The spectral- and observation-domainAVPs estimate the same coefficient for a covariate but emphasize low- and high-frequency data features respectively and thushighlight the covariate’s effect on the GP and error parts of the fit, respectively. The spectral approximation applies to dataobserved on a regular grid; for data observed at irregular locations, we propose smoothing the data to a grid before applyingour methods. The methods are illustrated using the forest-biomass data of Finley et al. (2008).

  • 1 supplemental PDF
Cover page of Coastline Kriging: A Bayesian Approach

Coastline Kriging: A Bayesian Approach


Statistical interpolation of chemical concentrations at new locations is an important step in assessinga worker’s exposure level. When measurements are available from coastlines, as is the case incoastal clean-up operations in oil spills, one may need a mechanism to carry out spatial interpolationat new locations along the coast. In this article, we present a simple model for analyzing spatial datathat is observed over a coastline. We demonstrate four different models using two different representationsof the coast using curves. The four models were demonstrated on simulated data andone of them was also demonstrated on a dataset from the GuLF STUDY (Gulf Long-term Follow-upStudy). Our contribution here is to offer practicing hygienists and exposure assessors with a simpleand easy method to implement Bayesian hierarchical models for analyzing and interpolating coastalchemical concentrations.

Cover page of Multivariate left‐censored Bayesian modeling for predicting exposure using multiple chemical predictors

Multivariate left‐censored Bayesian modeling for predicting exposure using multiple chemical predictors


Environmental health exposures to airborne chemicals often originate fromchemical mixtures. Environmental health professionals may be interested inassessing exposure to one or more of the chemicals in these mixtures, but often,exposure measurement data are not available, either because measurementswere not collected/assessed for all exposure scenarios of interest or because someof themeasurementswere below the analytical methods' limits of detection (i.e.,censored). In some cases, based on chemical laws, two or more componentsmay have linear relationships with one another, whether in single or multiplemixtures. Although bivariate analyses can be used if the correlation is high, correlationsare often low. To serve this need, this paper develops a multivariateframework for assessing exposure using relationships of the chemicals presentin these mixtures. This framework accounts for censored measurements in allchemicals, allowing us to develop unbiased exposure estimates.We assessed ourmodel's performance against simpler models at a variety of censoring levels andassessed our model's 95% coverage.We applied our model to assess vapor exposurefrom measurements of three chemicals in crude oil taken on the OceanIntervention III during the Deepwater Horizon oil spill response and cleanup.

  • 1 supplemental PDF