Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Statistical models for RNA biology : from single nucleotides to single cells

Abstract

With the advent of RNA sequencing and other high- throughput molecular assays, RNA biology has recently transitioned from careful curation of single-hypothesis experiments to data-driven design of multi-hypothesis investigations. Fortunately, statistical advances and increasingly powerful computers have given rise to machine learning, a computational framework which can automatically distill perpetually growing datasets into predictive models of fundamental cellular and disease processes. Finally, recent advances in microfluidics have enabled the efficient capture and interrogation of individual cells by a variety of molecular assays. My research bridges theses fields by introducing predictive statistical models of RNA abundance and processing in single cells to uncover new insights into the regulation of RNA editing and splicing and their effects on cellular differentiation. This dissertation collects my contributions in single-cell and statistical genomics, from low-level details of data analysis to high-level principles of cellular identity and diversity. My early contributions concentrate on building error models of RNA sequencing data in order to extract biologically-relevant signals from experimental noise and sampling biases inherent in high-throughput sequencing technologies. Specifically, I describe statistical models of RNA splicing and editing that are robust to noise from PCR duplicates or sequencing errors and to uneven sampling from incomplete reverse transcription or cDNA fragmentation biases. I then evaluate the models' self- consistency and compare their accuracy relative to a gold standard. With a solid statistical foundation for sequencing data analysis established, my latest contributions focus on developing principled methods of constructing and evaluating compelling biological hypotheses in collaboration with domain experts. Specifically, I describe a Bayesian model of A-to-I RNA editing whose high specificity helped resolve the functional difference between the catalytically-active RNA binding protein ADR-2, and its inactive homolog ADR-1. In another collaboration, I used machine learning to resolve a long-standing question in immunology regarding the asymmetric specification of T cells into two functionally distinct lineages. Here, through one of the first applications of single-cell gene expression analysis of the immune system, I demonstrate that pathogen-activated T cells undergo an early bifurcation into effector- and memory-fated populations and help identify the genes whose asymmetric expression drive this phenomenon. Together all of these contributions establish a principled statistical framework for experimental design and analysis which integrates both hypothesis- and data-driven models to validate new findings and uncover novel principles of RNA biology

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View