Recognizing Cell Identity: Classifying cell types in scRNAseq data
The classification of cell type is one of the first steps in scRNAseq analysis for translating observed transcriptional variation to biological insights. The same cell types can be sampled from different environment and using different technologies and their transcriptional profile can differ. Thus, defining cell types in scRNAseq data is much more than a matter of identifying clusters of cells that are similar to each other. In chapter 1, we developed a simulation method SymSim in order to understand the different facets of variability in scRNAseq. In Chapter 2, we applied a Bayesian Variational Inference method scVI for the harmonization scRNAseq datasets and propose a new method scANVI in the same frame work for the annotation of these datasets. We tested the performance of scVI and scANVI using both SymSim and experimental data. In Chapter 3 we applied our data harmonization method scVI to a Multiple Sclerosis (MS) case-control study using scRNAseq data to profile immune cells. We identified cellular changes associated with MS in tissue-specific cell type abundance and transcriptional changes after being able to identify shared cell types in both blood and CSF in multiple donors. In Chapter 4 we apply a number of scRNAseq harmonization and annotation including scVI and scANVI to a large consortium cell atlas project Tabula Sapiens. Tabula Sapiens aims to provide a comprehensive reference scRNAseq dataset for the scientific community. We developed an automatic annotation pipeline named PopularVote to facilitate the in-house data annotation process, and to be published for using as a public tool for other scientists to annotate their own data. This dissertation presents a set of tools that we developed or used in cell type annotation in a diverse set of scRNAseq applications (identifying rare cell types, comparing cell types across conditions, generating automatic data annotations). The potential of scRNAseq is best realized by generating a well-curated dataset that everyone in the research community can use and contribute to, and the ability to classify cells in an automatic manner will enable such efforts in the future.