Advances in high throughput omics technologies allow for assaying increasing compendium of molecular layers, from genome and epigenome profiling, transcriptomics to proteomics. Such data provide detailed snapshots which can characterize the molecular state for a given biology system from very fine resolution. Single cell genomics assays such as scRNA-seq and scATAC-seq specifically captures the landscape of genomic features across large collections of cells and have become one of the most popular molecular profiling techniques for investigating diverse problems related to gene regulation, such as identification of novel cell types and their regulatory signatures, trajectory inference for the analysis of continuous processes such as differentiation, high resolution analysis of transcriptional dynamics, and characterization of transcriptional heterogeneity within population of cells.
Despite the rapidly evolving technologies which can scales up to millions of cells across multiple individuals , one of the most pressing challenges in single cell genomics analysis is to address the amount of technical noise that can drive approximately 50% of the cell-cell variation in expression measurements. And such technical noise often times associated with high-sparsity of the genomic feature measurements. In chapter 2, we are mainly focusing on alleviating the effect of such technical variation in feature measurements of single cell genomics data, such as gene expression and locus accessibility. We show that this technical variation in both scRNA-seq and scATAC-seq datasets can be mitigated by analyzing feature detection patterns alone and ignoring feature quantification measurements. This result holds when datasets have low detection noise relative to quantification noise. We demonstrate state-of-the-art performance of detection pattern models using our new framework, scBFA, for both cell type identification and trajectory inference.
While single cell genomics assays are inherently high dimensional, the variations of individual cells are often summarized in a low dimensional space reflecting the change of gene’s mean expression. Gene co-expression networks, which often inferred from RNA sequencing data are another perspective to study cell type specific functional modules and complex regulatory interactions from transcriptomics profile. The increasing availability of large-scale scRNA-seq datasets is now making it possible to infer many gene networks from diverse cell populations. However, there are no mature tools currently available to visualize and compare large collections of networks across single cell populations, or for identifying correlations between variance in gene network structure with cell population-level phenotypes. In chapter 3, we present an unsupervised framework scMultiAE enabling comparison and visualization of multiple gene networks in a low-dimensional space with a focus on studying the heterogeneity of iPSCs during differentiation.