Skip to main content
Open Access Publications from the University of California


Open Access Policy Deposits bannerUCLA

Open Access Policy Deposits

This series is automatically populated with publications deposited by UCLA Fielding School of Public Health Department of Biostatistics researchers in accordance with the University of California’s open access policies. For more information see Open Access Policy Deposits and the UC Publication Management System.

A model-based approach to designing developmental toxicology experiments using sea urchin embryos.


The key aim of this paper is to suggest a more quantitative approach to designing a dose-response experiment, and more specifically, a concentration-response experiment. The work proposes a departure from the traditional experimental design to determine a dose-response relationship in a developmental toxicology study. It is proposed that a model-based approach to determine a dose-response relationship can provide the most accurate statistical inference for the underlying parameters of interest, which may be estimating one or more model parameters or pre-specified functions of the model parameters, such as lethal dose, at maximal efficiency. When the design criterion or criteria can be determined at the onset, there are demonstrated efficiency gains using a more carefully selected model-based optimal design as opposed to an ad-hoc empirical design. As an illustration, a model-based approach was theoretically used to construct efficient designs for inference in a developmental toxicity study of sea urchin embryos exposed to trimethoprim. This study compares and contrasts the results obtained using model-based optimal designs versus an ad-hoc empirical design.

A mammalian methylation array for profiling methylation levels at conserved sequences.


Infinium methylation arrays are not available for the vast majority of non-human mammals. Moreover, even if species-specific arrays were available, probe differences between them would confound cross-species comparisons. To address these challenges, we developed the mammalian methylation array, a single custom array that measures up to 36k CpGs per species that are well conserved across many mammalian species. We designed a set of probes that can tolerate specific cross-species mutations. We annotate the array in over 200 species and report CpG island status and chromatin states in select species. Calibration experiments demonstrate the high fidelity in humans, rats, and mice. The mammalian methylation array has several strengths: it applies to all mammalian species even those that have not yet been sequenced, it provides deep coverage of conserved cytosines facilitating the development of epigenetic biomarkers, and it increases the probability that biological insights gained in one species will translate to others.

Methylation studies in Peromyscus: aging, altitude adaptation, and monogamy.


DNA methylation-based biomarkers of aging have been developed for humans and many other mammals and could be used to assess how stress factors impact aging. Deer mice (Peromyscus) are long-living rodents that have emerged as an informative model to study aging, adaptation to extreme environments, and monogamous behavior. In the present study, we have undertaken an exhaustive, genome-wide analysis of DNA methylation in Peromyscus, spanning different species, stocks, sexes, tissues, and age cohorts. We describe DNA methylation-based estimators of age for different species of deer mice based on novel DNA methylation data generated on highly conserved mammalian CpGs measured with a custom array. The multi-tissue epigenetic clock for deer mice was trained on 3 tissues (tail, liver, and brain). Two human-Peromyscus clocks accurately measure age and relative age, respectively. We present CpGs and enriched pathways that relate to different conditions such as chronological age, high altitude, and monogamous behavior. Overall, this study provides a first step towards studying the epigenetic correlates of monogamous behavior and adaptation to high altitude in Peromyscus. The human-Peromyscus epigenetic clocks are expected to provide a significant boost to the attractiveness of Peromyscus as a biological model.

Phylogeography Reveals Association between Swine Trade and the Spread of Porcine Epidemic Diarrhea Virus in China and across the World.


The ongoing SARS (severe acute respiratory syndrome)-CoV (coronavirus)-2 pandemic has exposed major gaps in our knowledge on the origin, ecology, evolution, and spread of animal coronaviruses. Porcine epidemic diarrhea virus (PEDV) is a member of the genus Alphacoronavirus in the family Coronaviridae that may have originated from bats and leads to significant hazards and widespread epidemics in the swine population. The role of local and global trade of live swine and swine-related products in disseminating PEDV remains unclear, especially in developing countries with complex swine production systems. Here, we undertake an in-depth phylogeographic analysis of PEDV sequence data (including 247 newly sequenced samples) and employ an extension of this inference framework that enables formally testing the contribution of a range of predictor variables to the geographic spread of PEDV. Within China, the provinces of Guangdong and Henan were identified as primary hubs for the spread of PEDV, for which we estimate live swine trade to play a very important role. On a global scale, the United States and China maintain the highest number of PEDV lineages. We estimate that, after an initial introduction out of China, the United States acted as an important source of PEDV introductions into Japan, Korea, China, and Mexico. Live swine trade also explains the dispersal of PEDV on a global scale. Given the increasingly global trade of live swine, our findings have important implications for designing prevention and containment measures to combat a wide range of livestock coronaviruses.

Statistics or biology: the zero-inflation controversy about scRNA-seq data.


Researchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis.

DNA methylation aging and transcriptomic studies in horses.


Cytosine methylation patterns have not yet been thoroughly studied in horses. Here, we profile n = 333 samples from 42 horse tissue types at loci that are highly conserved between mammalian species using a custom array (HorvathMammalMethylChip40). Using the blood and liver tissues from horses, we develop five epigenetic aging clocks: a multi-tissue clock, a blood clock, a liver clock and two dual-species clocks that apply to both horses and humans. In addition, using blood methylation data from three additional equid species (plains zebra, Grevy's zebras and Somali asses), we develop another clock that applies across all equid species. Castration does not significantly impact the epigenetic aging rate of blood or liver samples from horses. Methylation and RNA data from the same tissues define the relationship between methylation and RNA expression across horse tissues. We expect that the multi-tissue atlas will become a valuable resource.

Efficient Algorithms and Implementation of a Semiparametric Joint Model for Longitudinal and Competing Risk Data: With Applications to Massive Biobank Data.


Semiparametric joint models of longitudinal and competing risk data are computationally costly, and their current implementations do not scale well to massive biobank data. This paper identifies and addresses some key computational barriers in a semiparametric joint model for longitudinal and competing risk survival data. By developing and implementing customized linear scan algorithms, we reduce the computational complexities from O(n 2) or O(n 3) to O(n) in various steps including numerical integration, risk set calculation, and standard error estimation, where n is the number of subjects. Using both simulated and real-world biobank data, we demonstrate that these linear scan algorithms can speed up the existing methods by a factor of up to hundreds of thousands when n > 104, often reducing the runtime from days to minutes. We have developed an R package, FastJM, based on the proposed algorithms for joint modeling of longitudinal and competing risk time-to-event data and made it publicly available on the Comprehensive R Archive Network (CRAN).

Epigenetic models developed for plains zebras predict age in domestic horses and endangered equids.


Effective conservation and management of threatened wildlife populations require an accurate assessment of age structure to estimate demographic trends and population viability. Epigenetic aging models are promising developments because they estimate individual age with high accuracy, accurately predict age in related species, and do not require invasive sampling or intensive long-term studies. Using blood and biopsy samples from known age plains zebras (Equus quagga), we model epigenetic aging using two approaches: the epigenetic clock (EC) and the epigenetic pacemaker (EPM). The plains zebra EC has the potential for broad application within the genus Equus given that five of the seven extant wild species of the genus are threatened. We test the EC's ability to predict age in sister taxa, including two endangered species and the more distantly related domestic horse, demonstrating high accuracy in all cases. By comparing chronological and estimated age in plains zebras, we investigate age acceleration as a proxy of health status. An interaction between chronological age and inbreeding is associated with age acceleration estimated by the EPM, suggesting a cumulative effect of inbreeding on biological aging throughout life.

Cover page of On identifiability and consistency of the nugget in Gaussian spatial process models

On identifiability and consistency of the nugget in Gaussian spatial process models


Spatial process models popular in geostatistics often represent the observed data as the sum of a smoothunderlying process and white noise. The variation in the white noise is attributed to measurement error,or micro-scale variability, and is called the “nugget”. We formally establish results on the identifiabilityand consistency of the nugget in spatial models based upon the Gaussian process within the framework ofin-fill asymptotics, i.e. the sample size increases within a sampling domain that is bounded. Our workextends results in fixed domain asymptotics for spatial models without the nugget. More specifically, weestablish the identifiability of parameters in the Matérn covariogram and the consistency of their maximumlikelihood estimators in the presence of discontinuities due to the nugget. We also present simulationstudies to demonstrate the role of the identifiable quantities in spatial interpolation.