Methodology and Applications for studying the Heterogeneity and Sequence Determinants of cis-Regulatory Elements
cis-regulatory elements (CREs) are non-coding segments of the genome which regulate the transcription of nearby genes. They can be broadly divided to two categories: 1) promoters, positioned directly upstream of their target gene, and 2) enhancers, positioned distally to their target gene. Enhancers are thought to be the main drivers of cell type-specific and state-specific transcription, and regulate gene expression by fine-tuning the rate of transcription, as opposed to the more binary (on or off) regulatory function that promoters typically have. Understanding how enhancers function is therefore crucially important to understanding how cells obtain and maintain certain fates and determine response to stimuli. Despite their importance, much is still unknown about the roles enhancers play in many biological processes, and how their sequence determines their regulatory function.
The first part of this dissertation deals with single-cell chromatin accessibility data (e.g as produced by single-cell ATAC-seq) as a means for systemically studying heterogeneity of CREs, and specifically enhancers. In chapter 2 this is demonstrated in the innate immune system's response to vaccination: in a subset of cells, a distinct state of chromatin accessibility maintains long-term epigenetic changes that prime these cells to a different response to stimuli, and provides non-specific viral protection.However promising, the unique properties of this data modality poses significant challenges. These are addressed in chapter 3, which introduced PeakVI, a deep generative model that provides a comprehensive statistical framework for analyzing data generated by scATAC-seq assays. Recent advances in sequencing technologies now enable obtaining these measurements alongside gene expression measurements (i.e single cell RNA-seq), providing the ability to directly measure the relationship between the heterogeneity of the chromatin landscape and that of the transcriptional profile. Chapter 4 introduces MultiVI, a general framework for the joint analysis of multi-modal single-cell data, using single-cell ATAC-seq and single-cell RNA-seq as the main example. These models enable exploration of cis-regulatory programs, identification of putative key enhancers, and generating hypotheses about their regulatory functions.
The second part of this dissertation focuses on analyzing high-throughput functional data produces by massively parallel reporter assays (MPRAs). These assays enable direct functional characterization of thousands of synthetically generated candidate regulatory sequences. However, these assays include both DNA-seq and RNA-seq observations, and require controlling for various technical confounders within both assays, posing substantial computational challenges. Chapter 5 describes MPRAnalyze, a nested generalized linear model that provides a comprehensive statistical framework for analyzing MPRA data. Chapter 6 then uses MPRAnalyze extensively to identify key enhancers and novel trancription factors involved in early neural differentiation. In chapter 7, systemic perturbation of binding sites in the identified enhancers reveal the specific sequence features that determine enhancer function, and elucidates how multiple functional sites interact in a single enhancer sequence to reach the desired functional output.