Skip to main content
Open Access Publications from the University of California

Department of Biostatistics

Open Access Policy Deposits bannerUCLA

Open Access Policy Deposits

This series is automatically populated with publications deposited by UCLA Fielding School of Public Health Department of Biostatistics researchers in accordance with the University of California’s open access policies. For more information see Open Access Policy Deposits and the UC Publication Management System.

scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics.


We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.

Cover page of Insights into the accuracy of social scientists' forecasts of societal change.

Insights into the accuracy of social scientists' forecasts of societal change.


How well can social scientists predict societal change, and what processes underlie their predictions? To answer these questions, we ran two forecasting tournaments testing the accuracy of predictions of societal change in domains commonly studied in the social sciences: ideological preferences, political polarization, life satisfaction, sentiment on social media, and gender-career and racial bias. After we provided them with historical trend data on the relevant domain, social scientists submitted pre-registered monthly forecasts for a year (Tournament 1; N = 86 teams and 359 forecasts), with an opportunity to update forecasts on the basis of new data six months later (Tournament 2; N = 120 teams and 546 forecasts). Benchmarking forecasting accuracy revealed that social scientists' forecasts were on average no more accurate than those of simple statistical models (historical means, random walks or linear regressions) or the aggregate forecasts of a sample from the general public (N = 802). However, scientists were more accurate if they had scientific expertise in a prediction domain, were interdisciplinary, used simpler models and based predictions on prior data.

Cover page of Genomic surveillance reveals dynamic shifts in the connectivity of COVID-19 epidemics

Genomic surveillance reveals dynamic shifts in the connectivity of COVID-19 epidemics


SummaryThe maturation of genomic surveillance in the past decade has enabled tracking of the emergence and spread of epidemics at an unprecedented level. During the COVID-19 pandemic, for example, genomic data revealed that local epidemics varied considerably in the frequency of SARS-CoV-2 lineage importation and persistence, likely due to a combination of COVID-19 restrictions and changing connectivity. Here, we show that local COVID-19 epidemics are driven by regional transmission, including across international boundaries, but can become increasingly connected to distant locations following the relaxation of public health interventions. By integrating genomic, mobility, and epidemiological data, we find abundant transmission occurring between both adjacent and distant locations, supported by dynamic mobility patterns. We find that changing connectivity significantly influences local COVID-19 incidence. Our findings demonstrate a complex meaning of ‘local’ when investigating connected epidemics and emphasize the importance of collaborative interventions for pandemic prevention and mitigation.

Cover page of Bayesian Hierarchical Modeling and Analysis for Actigraph Data From Wearable Devices

Bayesian Hierarchical Modeling and Analysis for Actigraph Data From Wearable Devices


The majority of Americans fail to achieve recommended levels of physical activity, which leads to numerous preventable health problems such as diabetes, hypertension, and heart diseases. This has generated substantial interest in monitoring human activity to gear interventions toward environmental features that may relate to higher physical activity. Wearable devices, such as wrist-worn sensors that monitor gross motor activity (actigraph units) continuously record the activity levels of a subject, producing massive amounts of high-resolution measurements. Analyzing actigraph data needs to account for spatial and temporal information on trajectories or paths traversed by subjects wearing such devices. Inferential objectives include estimating a subject's physical activity levels along a given trajectory; identifying trajectories that are more likely to produce higher levels of physical activity for a given subject; and predicting expected levels of physical activity in any proposed new trajectory for a given set of health attributes. Here, we devise a Bayesian hierarchical modeling framework for spatial-temporal actigraphy data to deliver fully model-based inference on trajectories while accounting for subject-level health attributes and spatial-temporal dependencies. We undertake a comprehensive analysis of an original dataset from the Physical Activity through Sustainable Transport Approaches in Los Angeles (PASTA-LA) study to ascertain spatial zones and trajectories exhibiting significantly higher levels of physical activity while accounting for various sources of heterogeneity

Cover page of Objective response rate (ORR) targets for recurrent glioblastoma clinical trials based on the historic association between ORR and median overall survival.

Objective response rate (ORR) targets for recurrent glioblastoma clinical trials based on the historic association between ORR and median overall survival.


Durable objective response rate (ORR) remains a meaningful endpoint in recurrent cancer; however, the target ORR for single arm recurrent glioblastoma trials has not been based on historic information or tied to patient outcomes. The current study reviewed 68 treatment arms comprising 4,793 patients in past trials in recurrent glioblastoma in order to judiciously define target ORRs for use in recurrent glioblastoma trials. ORR was estimated at 6.1%[95% CI 4.23; 8.76%] for cytotoxic chemotherapies (ORR=7.59% for CCNU, 7.57% for TMZ, 0.64% for CPT-11, and 5.32% for other agents), 3.37% for biologic agents, 7.97% for (select) immunotherapies, and 26.8% for anti-angiogenic agents. ORRs were significantly correlated with median overall survival (mOS) across chemotherapy (R2=0.4078,P<0.0001), biologics (R2=0.4003,P=0.0003), and immunotherapy trials (R2=0.8994,P<0.0001), but not anti-angiogenic agents (R2=0, P=0.8937). Pooling data from chemotherapy, biologics, and immunotherapy trials, a meta-analysis indicated a strong correlation between ORR and mOS (R2=0.3900, P<0.0001; mOS[weeks]=1.4xORR+24.8). Assuming an ineffective cytotoxic (control) therapy has ORR=7.6%, the average ORR for lomustine and temozolomide trials, a sample size of ≥40 patients with target ORR>25% is needed to demonstrate statistical significance compared to control with a high level of confidence (P<0.01) and adequate power (>80%). Given this historic data and potential biases in patient selection, we recommend that well-controlled, single-arm phase II studies in recurrent glioblastoma should have a target ORR >25% (which translates to a median OS of approximately 15 months) and a sample size of ≥40 patients, in order to convincingly demonstrate antitumor activity. Crucially, this response needs to have sufficient durability, which was not addressed in the current study.

Cover page of Plasma proteome perturbation for CMV DNAemia in kidney transplantation.

Plasma proteome perturbation for CMV DNAemia in kidney transplantation.



Cytomegalovirus (CMV) infection, either de novo or as reactivation after allotransplantation and chronic immunosuppression, is recognized to cause detrimental alloimmune effects, inclusive of higher susceptibility to graft rejection and substantive impact on chronic graft injury and reduced transplant survival. To obtain further insights into the evolution and pathogenesis of CMV infection in an immunocompromised host we evaluated changes in the circulating host proteome serially, before and after transplantation, and during and after CMV DNA replication (DNAemia), as measured by quantitative polymerase chain reaction (QPCR).


LC-MS-based proteomics was conducted on 168 serially banked plasma samples, from 62 propensity score-matched kidney transplant recipients. Patients were stratified by CMV replication status into 31 with CMV DNAemia and 31 without CMV DNAemia. Patients had blood samples drawn at protocol times of 3- and 12-months post-transplant. Additionally, blood samples were also drawn before and 1 week and 1 month after detection of CMV DNAemia. Plasma proteins were analyzed using an LCMS 8060 triple quadrupole mass spectrometer. Further, public transcriptomic data on time matched PBMCs samples from the same patients was utilized to evaluate integrative pathways. Data analysis was conducted using R and Limma.


Samples were segregated based on their proteomic profiles with respect to their CMV Dnaemia status. A subset of 17 plasma proteins was observed to predict the onset of CMV at 3 months post-transplant enriching platelet degranulation (FDR, 4.83E-06), acute inflammatory response (FDR, 0.0018), blood coagulation (FDR, 0.0018) pathways. An increase in many immune complex proteins were observed at CMV infection. Prior to DNAemia the plasma proteome showed changes in the anti-inflammatory adipokine vaspin (SERPINA12), copper binding protein ceruloplasmin (CP), complement activation (FDR = 0.03), and proteins enriched in the humoral (FDR = 0.01) and innate immune responses (FDR = 0.01).


Plasma proteomic and transcriptional perturbations impacting humoral and innate immune pathways are observed during CMV infection and provide biomarkers for CMV disease prediction and resolution. Further studies to understand the clinical impact of these pathways can help in the formulation of different types and duration of anti-viral therapies for the management of CMV infection in the immunocompromised host.

Cover page of Graphical Gaussian Process Models for Highly Multivariate Spatial Data.

Graphical Gaussian Process Models for Highly Multivariate Spatial Data.


For multivariate spatial Gaussian process (GP) models, customary specifications of cross-covariance functions do not exploit relational inter-variable graphs to ensure process-level conditional independence among the variables. This is undesirable, especially for highly multivariate settings, where popular cross-covariance functions such as the multivariate Matérn suffer from a "curse of dimensionality" as the number of parameters and floating point operations scale up in quadratic and cubic order, respectively, in the number of variables. We propose a class of multivariate "Graphical Gaussian Processes" using a general construction called "stitching" that crafts cross-covariance functions from graphs and ensures process-level conditional independence among variables. For the Matérn family of functions, stitching yields a multivariate GP whose univariate components are Matérn GPs, and conforms to process-level conditional independence as specified by the graphical model. For highly multivariate settings and decomposable graphical models, stitching offers massive computational gains and parameter dimension reduction. We demonstrate the utility of the graphical Matérn GP to jointly model highly multivariate spatial data using simulation examples and an application to air-pollution modelling.

Cover page of Ingestible sensor system for measuring, monitoring and enhancing adherence to antiretroviral therapy: An open-label, usual care-controlled, randomised trial.

Ingestible sensor system for measuring, monitoring and enhancing adherence to antiretroviral therapy: An open-label, usual care-controlled, randomised trial.



Co-encapsulated antiretrovirals (ARVs) with ingestible sensor (IS) has the capacity to monitor adherence in real-time using a sensor patch, a mobile device, and supporting software. We evaluated the acceptability, effectiveness, and sustainability of the IS system with real-time text reminders.


Participants were recruited from HIV clinics in Los Angeles and were randomised 1:1 to IS or usual care (UC) group. Adherence to ARVs (primary outcome) was measured by IS system (IS group only), plasma ARV concentration, and self-report. IS-measured adherence was clustered by group-based trajectory model and was validated by ARV concentration summarized by integrated pharmacokinetic adherence measure (IPAM) score. HIV RNA viral load (VL) was compared between IS and UC group.


A total of 112 (IS = 54, UC = 58) participants who completed baseline with at least one follow-up data collection were included in analyses. Overall satisfaction rate for the IS system was >90%. The IPAM score was higher (0.018, 95% CI: -0.098-0.134, p = 0.75) and VL decayed faster (-0.020, 95% CI: -0.042-0.002, p = 0.08) in the IS group compared with the UC group. The ingestible sensor system was well tolerated by study participants.


The IS system was well accepted by participants and its use was associated with improved adherence and lower HIV RNA VL. The findings provide a potentially effective strategy for improving adherence.


This work was supported by grant R01-MH110056 from the National Institute of Mental Health (NIMH)/National Institutes of Health (NIH). Y. Wang was in part supported by the NIMH/NIH award T32MH080634. E. Daar was in part supported by the National Center for Advancing Translational Sciences through UCLACTSI Grant UL1TR001881. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification


Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications, including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at implements ITCA and the search strategies.

Cover page of Context-specific emergence and growth of the SARS-CoV-2 Delta variant.

Context-specific emergence and growth of the SARS-CoV-2 Delta variant.


The SARS-CoV-2 Delta (Pango lineage B.1.617.2) variant of concern spread globally, causing resurgences of COVID-19 worldwide1,2. The emergence of the Delta variant in the UK occurred on the background of a heterogeneous landscape of immunity and relaxation of non-pharmaceutical interventions. Here we analyse 52,992 SARS-CoV-2 genomes from England together with 93,649 genomes from the rest of the world to reconstruct the emergence of Delta and quantify its introduction to and regional dissemination across England in the context of changing travel and social restrictions. Using analysis of human movement, contact tracing and virus genomic data, we find that the geographic focus of the expansion of Delta shifted from India to a more global pattern in early May 2021. In England, Delta lineages were introduced more than 1,000 times and spread nationally as non-pharmaceutical interventions were relaxed. We find that hotel quarantine for travellers reduced onward transmission from importations; however, the transmission chains that later dominated the Delta wave in England were seeded before travel restrictions were introduced. Increasing inter-regional travel within England drove the nationwide dissemination of Delta, with some cities receiving more than 2,000 observable lineage introductions from elsewhere. Subsequently, increased levels of local population mixing-and not the number of importations-were associated with the faster relative spread of Delta. The invasion dynamics of Delta depended on spatial heterogeneity in contact patterns, and our findings will inform optimal spatial interventions to reduce the transmission of current and future variants of concern, such as Omicron (Pango lineage B.1.1.529).