A central task in statistical analyses of infectious disease surveillance data is nowcasting transmission dynamics, understanding how transmissible a pathogen is in the present day. One way to summarize transmissibility is through the effective reproduction number, the average number of individuals an individual infected today would subsequently infect under current conditions. When the effective reproduction number is above one, an outbreak is expected to grow, the reverse is true when it is below one. Estimating the effective reproduction number from observed data is non-trivial, as epidemics are only ever partially observed, and existing data streams are subject to ascertainment biases that must be taken into account. Ideally, epidemics would be modeled as a partially observed stochastic process, but in practice this is computationally prohibitive. In this dissertation, we develop statistical models for estimating the effective reproduction number from a variety of data sources using a series of computationally tractable approximate models of epidemics. In particular, we develop models for estimating the effective reproduction number from case and test data, from pathogen genome concentrations collected from wastewater in large populations, and pathogen genome concentrations collected from wastewater in small populations. We compare our methods against state-of-the-art methods in simulation studies, and apply our methods to estimate the effective reproduction number of SARS-CoV-2 in California from 2020 to 2022.
Statistical modeling of infectious disease data is among the oldest applications of statistics. Today, it is an increasingly relevant application of research, due to globalization that enables diseases to spread further and faster, as well as the abundance of relevant data from electronic surveillance systems, seroprevalence studies, and genetic sequencing of pathogens. In this work, we develop novel statistical methods to combine varied data sources to improve both inference and forecasting. First, we work with data from assay validation studies and active surveillance studies to develop confidence intervals for prevalence estimates from complex surveys with imperfect assays. In this complicated setting, there are no established competitive methods, and ours exhibits at least nominal coverage. In addition, we apply our model in simplified cases where competitors exist and demonstrate desirable properties. Next, we develop a semi-parametric Bayesian compartmental model that effectively integrates passively collected time series of diagnostic tests and mortality data, as well as actively collected seroprevalence data. We emphasize retrospective inference and evaluate the utility of each data stream in the context of short-term forecasting. Finally, we focus on healthcare demand forecasting during epidemic surges of pathogen variants capable of immune escape. We build upon our Bayesian compartmental model to incorporate time series of cases, hospitalizations, ICU admissions, deaths, and genetic sequence counts. We show that using genetic information leads to superior forecasting performance, compared to traditional models. Throughout each project, we employ our methods to analyze a variety of COVID-19 data sets at the county, state, and national levels.
Hematopoiesis is the complex mechanism by which hematopoietic stem cells produce a variety of functional blood cells through multiple stages of differentiation. Since the numbers of various blood cell types need to be maintained in homeostasis, with occasional short-lived departures from it, hematopoiesis must have multiple regulatory mechanisms. However, these are still not fully understood. Although many mathematical models of hematopoiesis regulation have been proposed, more work on developing methods for fitting and interpreting experimental data that integrate statistical and mechanistic models is needed. Here, using a new chemical reaction ordinary differential equation model of negative feedback regulation in hematopoiesis, we develop a scalable, hierarchical Bayesian framework using a latent variables approach that takes cross heterogeneity into account and infers division, differentiation, and feedback regulation parameters of hematopoietic cells. We designed and performed an experiment where mice were injected with the chemotherapy drug 5-FU that reduces the number of stem and progenitor cells by blocking DNA synthesis and repair, to perturb the hematopoietic equilibrium. In order to count the number of cells in the BM, the mouse must be sacrificed. Therefore, each mouse can contribute their cell count data at a one time point only. To work with partially observed datasets, we use an ODE model to interpolate the noisy means of the experimental cell count data (the missing data is inferred). We evaluate the performance of the new model and inferential framework using synthetic data and find that we are able to distinguish between models that account for biological variation and models that include only technical variation/measurement error. We find that the experimental data are best described by a hierarchical model in which the hematopoiesis model parameters are allowed to vary among mice, suggesting the presence of significant biological variability. Our experimental data and the model show that, after perturbation, hematopoiesis returns to equilibrium via damped oscillations, with a notable overshoot of depleted cell counts that happens shortly after the system is perturbed from equilibrium. We then explore an alternative way of accounting for data heterogeneity by employing stochastic differential equations instead of letting division and feedback regulation parameters vary across mice. Computational tractability of the likelihood in a Bayesian inference framework is achieved by using the linear noise approximation (LNA) derived from the chemical Langevin equation. This enables us to approximate the joint posterior density for the hematopoietic rate value parameters and missing data. We evaluate the performance of the new Bayesian LNA model framework and compare it to the Bayesian ODE model frameworks we developed previously. We find that the new framework can further improve the out-of-sample prediction, as indicated by leave-one-out cross-validation. We identify limitations of inference for our LNA model when multiple sources of biological and technical variation of the dataset are significant and then develop a procedure for overcoming them. Finally, we investigate experimental designs that optimize the amount of information gained about the model parameter and missing data. We employ a new adversarial approach that uses a game theory framework for experimental design without the need for the calculation of the posterior probability distributions. This enables us to overcome the cost of traditional Bayesian optimal design methodology that requires repeated approximations of the posterior distributions, which are expensive to generate and prohibitively costly for high dimensional models.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.