Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Interpretable and efficient statistical approaches for biomedical data

Abstract

Statistics and machine learning have achieved remarkable successes in solving data problems including driving new biomedical discoveries. In particular, prediction and hypothesis testing are two important applications of statistics and machine learning to biomedical data. In this dissertation, we will investigate how appropriate interpretations of prediction algorithms, and scrutiny of efficiency of hypothesis testing techniques can help us extend the capability of statistical and machine learning approaches in biomedical science.

Chapter 1 of this dissertation provides an overview of the topics covered, as well as the background information for the rest of the dissertation. Chapters 2 and 3 introduce the applications of an interpretable machine learning prediction pipeline for two biomedical problems: drug response prediction, and molecular partner prediction in clathrin-mediated endocytosis.In the drug response prediction task, our predictive and stability-driven pipeline achieves the state-of-the-art performance in identifying stable, predictive -omics features for drug response. In the molecular partner prediction task, we developed a interpretable deep learning model that achieves state-of-the-art accuracy in predicting whether a clathrin-coated pit is abortive or valid.

Chapter 4 focuses on the interpretation of a specific algorithm: random forest. Random forest has witnessed numerous applications in biomedical sciences, and its interpretation has become an important topic of research. We derived the first finite sample bound on the bias of Mean Decrease Impurity, one of the most widely used measure of feature importance.To reduce this bias, we proposed a new feature importance measure, called MDI-oob. MDI-oob achieves state-of-the-art performance in feature selection from random forest in biology inspired simulations.

Chapters 5 and 6 aim to provide a more comprehensive understanding of some of the most popular high dimensional tests for biomedical data. Of particular interest is the comparison between special-purpose tests with the Bonferroni correction (or the closely associated max test in global testing), a simple and transparent test whose Type-I error (false positive) is robust to arbitrary dependence between $p$-values of univariate null hypotheses. In the context of global testing, we showed that the max test is optimal for detecting sparse signals, provided that the distribution of the signals has Gaussian or heavier tails. We also derived the first general negative results for knockoff methods. We give realistic conditions on the covariance matrix of the design matrix under which the true positive rate of the best achievable knockoff method must converge to zero, even when the true positive rate of Bonferroni correction converges to 1.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View