Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Probabilistic Models and Statistical Tools for Gene Expression Analysis

Abstract

The recent explosion of novel experimental protocols in the life sciences is providing glimpses into fundamental biological processes that previously remained inaccessible. The mechanisms underlying these processes are unique enough that understanding them often requires approaches beyond those traditionally associated with data-driven statistics. This thesis explores three such instances, in which borrowing ideas from geometry, probability theory and partial differential equations can lead to tangible improvements over existing methods, as well as new frameworks for future tasks. Our first analysis revolves around protein synthesis: the conversion of genes into viable polypeptides through what is known as translation. By exploiting the so-called Totally Asymmetric Simple Exclusion Process (TASEP) as a model of translation, we rephrase questions of biological interest in terms of Markov chains properties, which in turn we successfully tackle by deriving an adequate continuum limit of the TASEP. Analysis of this limiting process reveals a handful of key parameters that govern translation efficiency, whose roles we summarize in a concise set of design principles, and confirm on ribosome profiling data of yeast. Secondly, we direct attention to the task of gene expression deconvolution: recovering individual cell type contributions to the transcript abundances of an entire tissue. By embedding our deconvolution procedure into a full-likelihood framework, we not only provide provably optimal error guarantees, but also enable convenient model evaluation, adaptation and uncertainty quantification. We demonstrate this improved performance and flexibility on a variety of simulated and experimental bulk samples. And thirdly, motivated by detecting differential expression of genes across tissues, individuals or conditions, we investigate non-parametric two-sample testing. After identifying a broad family of statistics that includes as special cases Mann-Whitney’s U, Greenwood’s and Dixon’s, we employ combinatorial tools to quickly compute their null distributions’ moments to arbitrary precision. Combined with an equally fast and provably accurate solution to the related moment problem we thus arrive at a well-calibrated, versatile goodness-of-fit test with applicability beyond the gene expression setting. We showcase its power in various direct comparisons with a number of tests commonly used in practice.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View