Over the last two decades, technological improvements have led to a tremendous reduction in the cost and speed of DNA sequencing. This has opened the door to many new applications, including the quantification of transcriptomes at the resolution of single cells (scRNA-Seq), and the discovery of genetic features associated with phenotypes of interest, also known as genome-wide association study (GWAS).
scRNA-Seq has emerged in just 10 years as a major tool to investigate biological diversity. The ability to assess gene expression for individual cells enables the study of a range of biological processes at new levels of resolution. This can have many interesting applications, including the identification of novel cell types, defined by their unique transcriptomic signatures. Drawing from the clustering literature, many methods have been developed to group cells together based on their gene expression characteristics. However, all those algorithms require the tuning of hyper-parameters based on typically ad hoc recommendations. Moreover, the direct validation of the discovered cell types is generally difficult, if not impossible. The grouping of cells is also not unique for a given biological system, and there often exists a hierarchy of cell types, with ever-finer levels of resolutions. Because of all this, the discovery of reliable, replicable cell types remains a major challenge. In Chapter 2, we will delve deeper into this issue and introduce a new method called Dune that tackles the resolution-replicability trade-off in clustering.
scRNA-Seq data also enable the tracking of continuous developmental changes, without the need for arbitrary discretization that stemmed purely from the data collection protocol. This allows to investigate processes such as the cell cycle, the differentiation of stem cells into different cell types, or the cellular response to a drug over time. In Chapter 3, we will investigate how to characterize patterns of gene expression along such developmental trajectories, to identify dynamic genes and drivers of differentiation, using the tradeSeq method. In Chapter 4, we will provide a general workflow called condiments for analyzing such dynamic systems in the presence of multiple conditions, such as treatment/control.
GWAS represent another field that has gained major attention following the emergence of cheaper high-throughput sequencing technologies. In human populations, the problem has been extensively studied, mainly in the context of diseases such as diabetes. However, GWAS can also be applied to bacterial genomes, especially in the context of antibiotic resistance. Some concepts from the human GWAS literature are applicable in bacteria. However, characteristics of bacterial genome mean that other concepts, such as that of a reference genome, are inappropriate and irrelevant. New methods need to be developed for this specific problem. In Chapter 5, we present a new subgraph enumeration method named CALDERA that leverages the structure of the data to provides more robust analyses and facilitate the interpretation of bacterial GWAS data.