Statistical inference of a single cell gene expression model for heterogeneous tissues
Biological tissues are made up of many individual cells that perform different tasks in concert to carry out sophisticated and essential processes. Knowing the behavior of each cellular subpopulation in a tissue can therefore provide fundamental insight to how a tissue's normal functions are mis-regulated in disease. The recent advent of single cell RNA-sequencing has enabled us to measure the gene expression profiles (a proxy of cellular behavior) of the individual cells that constitute a heterogeneous tissue.
Single cell RNA-seq has already been used to identify new cell types in heterogeneous tissues, however its further application is limited by the need for analyses designed specifically for this new data type. Because single cell RNA-seq samples gene expression profiles from the underlying tissue sample, the result is distribution of global gene expression states. Analyses developed for handing individual gene expression profiles from bulk RNA-sequencing are not applicable to this new data type, nor are they capable of handling the increased measurement noise or tissue heterogeneity.
In this thesis we present new statistical analyses for analyzing gene expression distributions within tissue samples and comparing them across tissue samples. These analyses share a common approach, to exploit a natural property of transcriptional systems to reduce the complexity of this data. Through a survey of over 500 datasets, we find that global gene expression profiles can be accurately represented as a linear combination of a relatively small number of gene expression ``programs''.
In order to represent heterogeneous tissues, infer a statistical model of the underlying distribution of single cell gene expression states in a tissue. Using low dimensional representations, we can now use Gaussian mixture models to fit a distribution to each cellular subpopulation within a tissue. This unbiased model introduces a natural sense of distance between heterogeneous tissue samples, can be used to identify patient specific signatures, monitor disease progression through treatment, and classify disease state from just 15 single cell transcriptomes. Throughout these applications, our model also reveals biological insights including shifts in myeloid cell abundances, MHC I downregualtion for immune evasion, and a clonally expanded tumor population.