Search

Scholarly Works (168 results)

Sort By:

Show:

Thesis
Peer Reviewed

The LITSE Algorithm: Theory and Application

Hansen, Curt
Advisor(s): Hubbard, Alan

UC Berkeley Electronic Theses and Dissertations (2015)

In this dissertation, we present a novel method -- the Learning with Iteration and Tree-based Search Estimation algorithm -- for the estimation of the malarial haplotype composition in one or more individuals and the corresponding haplotype population frequencies, focusing in particular on the case where individuals have been infected by more than one strain. This estimation must take place in the presence of pooled readings of the genetic composition of the parasites present.

The approach consists of the combination of a parameterized tree-based combinatorial search and a refinement phase incorporating the Expectation Maximization algorithm. The EM algorithm is particularly attractive as it is structured to be applied to situations involving both observed and unobserved information.

A test of an implementation of the algorithm on simulated data demonstrates its effectiveness in accurately estimating the haplotype compositions, both prior to and following the refinement. Its effectiveness established, the algorithm is then applied to a set of laboratory-produced malarial strain data.

In addition, the algorithm has also been made available to other researchers through a dedicated website allowing submissions and the downloading of results.

While the current research focused on the application of the method to malarial parasites, the method is general enough to be applied to cases of infection by other organisms.

Finally, the dissertation presents several suggestions for future work in enhancing the algorithm both computationally and statistically and extending its scope to related research topics.

Cover page: The LITSE Algorithm: Theory and Application

Thesis
Peer Reviewed

Semiparametric Prediction, Variable Importance, and Effect Estimation in Critical Care

Decker, Anna
Advisor(s): Hubbard, Alan E

UC Berkeley Electronic Theses and Dissertations (2014)

Trauma injury is one of the leading causes of death in the United States, accounting for over 120,000 deaths in 2010 according to the CDC. Understanding the underlying mechanisms and improving the treatment of trauma is of great clinical and public health interest. The systematic collection and study of critical care data originated in combat conflicts and wars and more recently to civilian centers. Improving patient outcomes, the quality of care received, and identifying high-risk patients are unmet needs in this field.

Clinicians rely on their intuition, training, and heuristic scoring systems to identify patients who are likely to die or experience other outcomes such as the need for a massive transfusion, which resuscitates the patient via the infusion of blood products such as plasma, platelets, and red blood cells. We assessed the ability of measured covariates to predict various clinical outcomes, demonstrate the utility of machine-learning prediction algorithms, and examined the predictive performance of a commonly-used score to predict massive transfusion. This highlights the need for a principled approach to predicting outcomes that does not rely only on ad hoc procedures.

In addition to the prediction of clinical outcomes, we defined a measure of variable importance for ranking predictors based on their relationship with the outcome of interest. This parameter was motivated by causal inference and requires a systematic approach to the question of interest that helps translate it into a parameter with a clinically meaningful interpretation rather and maintains transparency about the assumptions required to deem the parameter a causal effect. We apply this procedure to gene expression data from critically injured patients to illuminate how the coagulation and inflammation pathways react to trauma injury.

Finally, we compare the quality of care received at different trauma center types around the United States using another parameter motivated by causal inference. This allowed us to simulate what would have happened to a patient if they had been treated at a different trauma center and obtain an objective comparison that identified sites where severely injured patients would benefit most from being treated.

This research highlights the utility of causal inference for framing problems, motivating clinically meaningful statistical parameters, and interpreting the results. We also advocate for the use of semiparametric prediction algorithms to allow for greater flexibility in modeling assumptions and demonstrate their performance in practice.

Cover page: Semiparametric Prediction, Variable Importance, and Effect Estimation in Critical Care

Thesis
Peer Reviewed

Causal Inference and Prediction in Health Studies: Environmental Exposures and Schistosomiasis, HIV-1 Genotypic Susceptibility Scores and Virologic Suppression, and Risk of Hospital Readmission for Heart Failure Patients

Sudat, Sylvia
Advisor(s): Hubbard, Alan E

UC Berkeley Electronic Theses and Dissertations (2012)

Causal inference-inspired semi-parametric methods of measuring variable importance are well designed to answer questions of interest in health settings. Unlike traditional regression approaches, such variable importance measures are based on causal parameters that have straightforward real-world definitions, regardless of the approach used to estimate them. Parameters of regression models, in contrast, are not at all straightforward to interpret in real-world settings, because their definition relies completely on the correctness of the pre-specified model. Prediction-focused machine learning methods can avoid the issues of model pre-specification, but still do not provide estimates of variable importance that can be easily interpreted; the set of predictors chosen can also be highly variable. Semi-parametric methods combine the best of both approaches, and are able to utilize data-adaptive estimation algorithms while still returning a parameter estimate that is meaningful and can be simply understood.

In this dissertation, semi-parametric methods to assess variable importance are applied to three real-world health applications: the relationship between types of water contact and the prevalence of schistosomiasis infection in rural China; HIV-1 treatment regimen genotype susceptibility scores and their relationship with the rate of virologic suppression; and the impact of a telemanagement program on and the association of multiple risk factors with the rates of hospital readmission for heart failure patients. Emphasized are (1) the choice of parameter of interest as motivated by the research question, (2) estimator choice based on a consideration of theoretical properties and performance under non-ideal conditions, and (3) the use during the estimation process of machine learning algorithms and algorithms that utilize multiple candidate models. Four different causal parameters are defined and described, and multiple estimators are considered.

Each data analysis presents different opportunities to investigate aspects of causal inference-based semi-parametric methods. In the schistosomiasis analysis, a traditional regression approach is compared with semi-parametric methods. Estimator performance is compared in the HIV analysis, particularly in the context of the observed extreme violations of the experimental treatment assignment (ETA) assumption. The G-computation estimator, the inverse-probability-of-censoring-weighted (IPCW), its double-robust counterpart (DR-IPCW), and the targeted maximum likelihood estimator (TMLE), are included in this comparison. The heart failure analysis addresses differences in causal parameter definition for a community-level treatment, and the related assumptions that must be added to the typical theoretical framework. Also included in this analysis is a comparison of super learning with traditional regression in terms of predictive performance.

Cover page: Causal Inference and Prediction in Health Studies: Environmental Exposures and Schistosomiasis, HIV-1 Genotypic Susceptibility Scores and Virologic Suppression, and Risk of Hospital Readmission for Heart Failure Patients

Thesis
Peer Reviewed

Small Sample Inference

Gerlovina, Inna
Advisor(s): Hubbard, Alan E

UC Berkeley Electronic Theses and Dissertations (2016)

Multiple comparisons and small sample size, common characteristics of many types of "Big Data" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates estimation of very small tail probabilities and thus approximation of distal tails of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore what it might translate into in terms of actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, indicating poor error rate control.

Edgeworth expansions, providing higher order approximations to the sampling distribution, also offer a promising direction for data analysis that could ameliorate the situation. In Chapter 1, we derive generalized expansions for studentized mean-based statistics that incorporate ordinary and moderated one- and two-sample t-statistics as well as Welch t- test. Fifth-order expansions are generated with our developed software that can be used to produce expansions of an arbitrary order. In Chapter 2, we propose a data analysis method based on these expansions that includes tail diagnostic procedure and small sample adjustment. Using the software algorithm developed for generating expansions, we also obtain results for unbiased moment estimation of a general order. Chapter 3 introduces a general linear combination (GLC) bootstrap, which is specifically tailored for small sample size. A stabilized variance version of GLC bootstrap, based on empirical Bayes approach, is developed for high-dimensional data. Applying these methods to clustering, we propose an inferential procedure that produces pairwise clustering probabilities.

2 supplemental ZIPs

Thesis
Peer Reviewed

Finding Genes Related to Disease Using Statistical Learning

Goldstein, Benjamin Alan
Advisor(s): Hubbard, Alan E.

UC Berkeley Electronic Theses and Dissertations (2011)

This dissertation consists of the analyses of three separate genetic association datasets. Each represents a unique data structure with a different question of interest that therefore require distinct approaches and methodologies. As such, the three substantive chapters (2-4) can each stand on their own. However, the over-arching question in each of these studies is the same: which genes (or genetic material) are related to the disease or outcome being studied. Moreover, while the methodologies are each distinct, they all incorporate statistical learning methodologies to obtain some modicum of inference.

Study 1 - As computational power has improved the application of statistical learning algorithms to finding SNPs related to disease has become more ubiquitous. The hope is that these algorithms will be more capable than typical marginal testing in detecting SNPs with higher order effects. The Random Forests (RF) algorithm is one such algorithm that has seen increased use with genetic data. As part of its output, RF ranks the predictor variables (SNPs) on their relative importance.

The present study represents the first application of the RF algorithm to Genome Wide Association (GWA) data and investigates how best to use the algorithm for this unique data structure. A multiple sclerosis (MS) GWA data set is used for the analysis. Results indicate the typical tuning parameter settings need to be adjusted for the high degree of sparsity in the data. Furthermore, most meaningful results were obtained when both unimportant and overly important SNPs were removed. RF was able to replicate some previous findings using the same data. Moreover, four genes not previously associated with MS were identified.

Study 2 - In many analyses, one has data on one level but desires to draw inference on another level. For example, in genetic association studies, one observes units of DNA referred to as SNPs, but wants to determine whether genes that are comprised of SNPs are associated with disease. While there are some available approaches for addressing this issue, they usually involve making parametric assumptions and are not easily generalizable. A statistical test is proposed for testing the association of a set of variables with an outcome of interest. No assumptions are made about the functional form relating the variables to the outcome. A general function is fit using any statistical learning algorithm, with the SuperLearner algorithm suggested. The parameter of interest is the cross-validated risk and this is compared to an expected risk. A Wald test is proposed using the influence curve of the cross-validated risk to obtain the variance. It is shown both theoretically and via simulation that the test maintains appropriate type I error control and is more powerful than parametric tests under more general alternatives. The test is applied to an MS candidate gene study. Three separate analyses are performed highlighting the flexibility of the approach.

Study 3 - Secondary analyses, such as Gene Ontology and Motif analysis, have become central components of gene expression experiments, allowing researchers to derive biological understanding from the set of genes that are differentially expressed. An important statistical task is determining which genes should be passed on to such programs and how the genes should be grouped for analysis. The typical approach is to cluster the set of differentially expressed genes, and pass these clusters on to the secondary analyses. However, many expression experiments have specific hypotheses which allow one to analyze the genes and group them in a more targeted approach. To illustrate the utility of being more specific, a gene expression study of C. elegans is used where a particular outcome was observed and hoped to be explained. A general

model is fit and analyzed to estimate the parameters corresponding to the specific hypothesis, leading to four natural groupings of the differentially expressed genes. These groupings lead to meaningful results in the secondary analyses that allow for the biologist to make robust hypotheses that are experimentally confirmed. It is shown that a traditional approach would not have yielded such robust findings.

Cover page: Finding Genes Related to Disease Using Statistical Learning

Thesis
Peer Reviewed

Statistical Methods for Predicting Dengue Diagnosis using Clinical and LC-MS Data

Cotterman, Carolyn Louise
Advisor(s): Hubbard, Alan E.

UC Berkeley Electronic Theses and Dissertations (2015)

Dengue virus is the most widespread arthropod-borne virus affecting humans, with as many as 528 million annual infections each year. Of particular concern are the subset of cases which develop into life-threatening dengue hemorrhagic fever, and those which further progress into dengue shock syndrome. Non-invasive tools that accurately differentiate dengue and its subtypes from other viral infections early in the disease progression are vital for timely therapeutic intervention and supportive care. Unfortunately, such tools are sorely lacking. Using liquid chromatography-mass spectrometry (LC-MS), we detect tens of thousands of molecular features in serum, saliva, and urine of suspected dengue patients in Nicaragua. We then use machine-learning methods to help identify candidate small molecule biomarkers which, along with easily obtainable clinical data, predict dengue diagnosis and prognosis. Our findings should aid in developing a low-cost diagnostic tool for use in the field.

Cover page: Statistical Methods for Predicting Dengue Diagnosis using Clinical and LC-MS Data

Article
Peer Reviewed

Extreme heat and its association with social disparities in the risk of spontaneous preterm birth

UCLA Previously Published Works (2022)

Background

Climate change is increasing the frequency and intensity of heatwaves. Prior studies associate high temperature with preterm birth.

Objectives

We tested the hypotheses that acute exposure to extreme heat was associated with higher risk of live spontaneous preterm birth (≥20 and <37 completed weeks), and that risks were higher among people of colour and neighbourhoods with heat-trapping landcover or concentrated racialised economic disadvantage.

Methods

We conducted a retrospective cohort study of people giving birth between 2007 and 2011 in Harris County, Texas (Houston metropolitan area) (n = 198,013). Exposures were daily ambient apparent temperature (AT_max in 5°C increments) and dry-bulb temperatures (T_max and T_min >historical [1971-2000] summertime 99^th percentile) up to a week prior for each day of pregnancy. Survival analysis controlled for individual-level risk factors, secular and seasonal trends. We considered race/ethnicity, heat-trapping neighbourhood landcover and Index of Concentration at the Extremes as effect modifiers.

Results

The frequency of preterm birth was 10.3%. A quarter (26.8%) of people were exposed to AT_max ≥40°C, and 22.8% were exposed to T_max and T_min >99^th percentile while at risk. The preterm birth rate among the exposed was 8.9%. In multivariable models, the risk of preterm birth was 15% higher following extremely hot days (hazard ratio [HR] 1.15 (95% confidence interval [CI] 1.01, 1.30) for AT_max ≥40°C vs. <20°C; HR 1.15 (95% CI 1.02, 1.28) for T_max and T_min >99^th percentile). Censoring at earlier gestational ages suggested stronger associations earlier in pregnancy. The risk difference associated with extreme heat was higher in neighbourhoods of concentrated racialised economic disadvantage.

Conclusions

Ambient heat was associated with spontaneous preterm birth, with stronger associations earlier in pregnancy and in racially and economically disadvantaged neighbourhoods, suggesting climate change may worsen existing social inequities in preterm birth rates.

Cover page: Extreme heat and its association with social disparities in the risk of spontaneous preterm birth

Article
Peer Reviewed

Californias Public Safety Realignment Act and prisoner mortality.

UC Berkeley Previously Published Works (2023)

In 2011, a historic Supreme Court decision mandated that the state of California substantially reduce its prison population to alleviate overcrowding, which was deemed so severe as to preclude the provision of adequate healthcare. To comply, California passed the Public Safety Realignment Act (Assembly Bill [AB] 109), representing the largest ever court-ordered reduction of a prison population in U.S. history. AB109 was successful in reducing the state prison population; however, although the policy was precipitated by inadequate healthcare in state prisons, no studies have examined its effects on prisoner health. As other states grapple with overcrowded prisons and look to Californias experience with this landmark policy, understanding how it may have impacted prisoner health is critical. We sought to evaluate the effects of AB109 on prison mortality and assess the extent to which policy-induced changes in the age distribution of prisoners may have contributed to these effects. To do so, we used prison mortality data from the Bureau of Justice Statistics and the California Deaths in Custody reporting program and prison population data from the National Corrections Reporting Program to examine changes in overall prison mortality, the age distribution of prisoners, and age-adjusted prison mortality in California relative to other states before and after the implementation of AB109. Following AB109, California prisons experienced an increase in overall mortality relative to other states that attenuated within three years. Over the same period, California experienced a greater upward shift in the age distribution of its prisoners relative to other states, suggesting that the states increase in overall mortality may have been driven by this change in age distribution. Indeed, when accounting for this differential change in age distribution, mortality among California prisoners exhibited a greater reduction relative to other states in the third year after implementation. As other states seek to reduce their prison populations to address overcrowding, assessments of Californias experience with AB109 should consider this potential improvement in age-adjusted mortality.

Cover page: Californias Public Safety Realignment Act and prisoner mortality.

Thesis
Peer Reviewed

Computational Considerations for Targeted Learning

Coyle, Jeremy Robert
Advisor(s): Hubbard, Alan E

UC Berkeley Electronic Theses and Dissertations (2017)

Targeted Learning represents a principled methodology that has the potential to leverage the availability of big datasets and large scale computing facilities. However, many of the methods are computationally demanding, and therefore require careful consideration as to their implementation. This thesis comprises three cases studies at the intersection between Targeted Learning and computation. Chapter 1 describes the Targeted Bootstrap, a novel bootstrap technique that samples from a TMLE distribution and therefore has asymptotic performance guarantees, while avoiding issues related to cross-validation on bootstrap samples. Chapter 2 considers the problem of estimating both a target parameter and nuisance parameter on which it depends, when ideally both would be estimated with cross-validation. By carefully considering what parts of the sample are used for what estimation tasks, nested cross-validation can be avoided at great computational savings. This is achieved using the novel SplitSequential cross-validation approach. Chapter 3 describes the opttx package for learning optimal treatment rules. This package contains an implementation of SplitSequential Super Learner, and also contains a novel approach to learning an optimal rule for a categorical treatment variable. Further, performance-based variable importance measures are used to evaluate which of the covariates are most useful for making treatment decisions.

Cover page: Computational Considerations for Targeted Learning

Thesis
Peer Reviewed

Software for prediction and estimation with applications to high-dimensional genomic and epidemiologic data

Ritter, Stephan Johannes
Advisor(s): Hubbard, Alan E.

UC Berkeley Electronic Theses and Dissertations (2013)

Three add-on packages for the R statistical programming environment (R Core Team, 2013) are described, with simulations demonstrating performance gains and applications to real data. Chapter 1 describes the relaxnet package, which extends the glmnet package with relaxation (as in the relaxed lasso of Meinshausen, 2007). Chapter 2 describes the widenet package, which extends relaxnet with polynomial basis expansions. Chapter 3 describes the multiPIM package, which takes a causal inference approach to variable importance analysis. Section 3.7 describes an analysis of data from the PRospective Observational Multicenter Major Trauma Transfusion (PROMMTT) study (Rahbar et al., 2012; Hubbard et al., 2013), for which the multiPIM package is used in conjunction with the relaxnet and widenet packages to estimate variable importances.

Cover page: Software for prediction and estimation with applications to high-dimensional genomic and epidemiologic data