Single-cell sequencing has emerged as a powerful tool for dissecting cellular heterogeneity and providing cell type-specific biological insights. Single-cell sequencing technologies have rapidly proliferated over the last decade, leading to an explosion of data generated from such experiments. However, several challenges exist in the computational analysis of single-cell sequencing data due to its large and complex nature, including the need for sophisticated statistical methods to distinguish biologically meaningful signals from noise, the integration of single-cell sequencing data with other types of biological information, and the development of scalable and reproducible computational pipelines that can handle the large and complex nature of the data. In this dissertation, I present two distinct projects analyzing single-cell sequencing data. The first is of an analytical nature and tackles a translational question. In this project, I built computational pipelines for processing and analyzing single-nucleus RNA- and ATAC-sequencing datasets generated from the amygdalae of genetically diverse heterogenous stock rats, which were subjected to a behavioral protocol for studying addiction-like behaviors following cocaine self-administration. In doing so, I provide a standard reference for analyzing such data as well as reveal cell type-specific insights into the molecular underpinnings of cocaine addiction. The second project is oriented towards methods development and seeks to understand the fundamental biological question of transcriptional regulation. Here, I developed a statistical framework for simulating and modeling data from single-cell CRISPR regulatory screens and used it to perform a genome-wide interrogation of epistatic-like interactions between enhancer pairs. I found that multiple enhancers act together in a multiplicative fashion with little evidence for interactive effects between them. This work revealed novel insights into the collective behavior of multiple regulatory elements and provides a tool that can be applied to future datasets generated from such experiments. This dissertation exemplifies how computational methods can be applied in different contexts to extract meaning from a variety of single-cell sequencing modalities. By tackling both a translational and fundamental biological question, I have showcased the breadth of what can be revealed by studying single-cell sequencing data and the computational methods necessary to extract this information.
When mapping expression quantitative trait loci, a linear additive genetic model is mostly commonly used to investigate how genetic variants influence transcript levels. This model assumes that the phenotype of heterozygotes is halfway between that of the low-homozygous and high-homozygous genotypes and may miss non-additive relationships, such as those caused by dominant alleles. Here we examine RNA-Seq data to identify dominant genetic associations with gene expression in the human genome. We applied a multiple linear regression model on genotypes and RNA-Seq data from Genotype-Tissue Expression project. With stringent permutations, we discovered that on average, 0.19% of all genes tested (including non-coding RNAs and pseudogenes) show evidence for dominant genetic associations across ten different tissues. Most dominant effect sizes are positive, implying that the phenotypes of heterozygotes tend to have similar gene expression levels to high-expression homozygotes. In 8 out of the 10 tissues we examined, we found that genes encoding major histocompatibility complex (MHC) proteins are enriched for dominant effects.
Precise regulation of gene expression is crucial for organismal development. However, knowledge of regulatory genomic sequences (functional sequences), their targets, and modes of activation remains limited. Recently, tiling CRISPR screens have been developed for the unbiased interrogation of the genome within its native context. These screens leverage the CRISPR-Cas9 system to perturb putative functional sequences and examine their effects on gene expression. This approach makes it possible to identify functional sequences as well as their target genes. In this dissertation I will highlight the aspects of tiling CRISPR screens that make them both attractive to use as well as difficult to analyze and present the different analytical approaches to date. Notably, I will describe our method RELICS, which models several key components of tiling CRISPR screens to accurately identify functional sequences. In the first chapter I describe a simulation tool, CRSsim, which I developed to systematically evaluate different analysis methods for CRISPR screens against one another. This chapter highlights the importance of simulations and shows how I statistically recreated the generative process of data from CRISPR screens to simulate realistic data sets for benchmarking. In the second chapter I present RELICS, a method developed specifically for identifying functional sequences from tiling CRISPR screens. I will describe how RELICS models the data and demonstrate that it outperforms all other methods which are currently used for analyzing tiling CRISPR screens. Finally, I will present the results of RELICS applied to different experimental datasets, including publicly available datasets as well as data from our in house GATA3 tiling deletion screen. Importantly, we discovered and validated novel functional sequences that were not detected by competing methods. Some of these sequences do not exhibit canonical epigenetic marks of regulatory elements, highlighting the importance of tiling CRISPR screens as an unbiased approach for detecting functional sequences and illuminating the regulatory landscape.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.