Statistical Analysis and Integration of Multi-Modal Sequencing Data
- Gong, Boying
- Advisor(s): Purdom, Elizabeth
Abstract
The explosion of high-throughput sequencing technologies in the past decade has enabled the measurement of various molecules and biological processes at the bulk level and the single-cell level, resulting in data modalities such as gene expression, methylation, and chromatin accessibility. Each of these modalities has distinct data properties and therefore poses different statistical challenges. It is important to both study these modalities separately as well as jointly in order to understand their individual characteristics and their complex interactions. This dissertation addresses both questions. For the modeling of a single modality, we first introduce a method for the differential analysis of bulk methylation sequencing data. We propose an approach based on change-point detection that identifies regions in the genome that exhibit different methylation levels between groups or over time. We show that our approach gives improved performance compared to existing methods. We then provide a case study on single-cell gene expression data, illustrating the typical workflow and addressing the challenges involved in analyzing such datasets. For the integration of multiple modalities, we introduce a novel method that focuses on a question that has not yet been addressed: the jointly modeling of single-cell single-modality platforms with multi-modality platforms. We apply the method to integrate a single-cell gene expression dataset, a single-cell chromatin accessibility dataset, and a single-cell joint-platform dataset that sequences both gene expression and chromatin accessibility.