Leveraging replicable sources of variability to increase power and interpretability in analyses of genomic datasets
- Author(s): Thompson, Michael
- Advisor(s): Halperin, Eran
- et al.
Many types of genomic datasets—including RNA sequencing (RNAseq) and DNA methylation—are influenced by innumerable sources of variability. Frequently, analyses of such variability focus on local effects due to genetics, often overlooking the components of variability related to context-level, individual-level, or environmental effects. Here, we leverage the idea that sources of variability are often conserved across genomic datasets to propose two approaches to partition variability: first into distinct biological and technical components, and second into orthogonal context-specific and context-shared genetic components. Using our methods, we perform more powerful and interpretable genomic association studies (such as transcriptome- or epigenome-wide association studies), and we uncover that heritability is more context-specific at the level of single-cell RNAseq, whereas it is more context-shared at the level of bulk (tissue) RNAseq. Subsequently, we perform an analysis of medical records to elucidate the informativeness and impacts of multiple genomics data types on phenotype imputation tasks. We show that risk scores derived from one’s methylation are more informative than risk scores derived from one’s genotypes in imputation tasks. The work presented here shows lasting impact on the design of multiple classes of genomic association studies as well as studies of the utility of genomic biomarkers in electronic medical records.