The massive generation of genetic, epigenetic, transcriptomic, and other sources of data, allows us to pursue biological questions at scale while simultaneously adding a systems-level context to hypotheses in biology. Questions about gene expression have driven us to understand various chromatin components, most recently that has lead to the study of chromatin conformation via high-throughput methods such as HiC or HiChIP. To obtain a full understanding of chromatin conformation, integration with genetics variants (e.g. SNPs from GWAS and eQTL studies) and epigenetics signals (e.g. histone acetylation, open chromatin regions, transcription factor binding, etc) is essential. Similarly, complex diseases such as cancer can advance via a system of distinct factors that interact to form a deliberate and potent pathogenic regulatory network. Thus, it is imperative we build the resources and tools necessary to integrate multiomics signals together.
Here, I present three chapters derived from two major works that demonstrate the importance of data integration for a holistic understanding of biology. First, I present a database of HiChIP data for over 1000 samples (chapter 1) with important applications for the analysis of motifs, GWAS and eQTL studies, and network analysis (chapter 2). Second, I showcase and described the nipalsMCIA R package which reduces datasets for a systems level analysis of multiomics data (chapter 3).
Many histone marks, obtained through chromatin immunoprecipitation (ChIP) followed by massively parallel DNA sequencing (ChIP-seq) are used as the input features of complex machine learning frameworks in the gene expression prediction task. However, a ChIP-seq assay requires access to a large number of viable cells whose nuclei are intact, a limitation if viable cells are not available and the only source of cellular material is DNA, or if cells are subjected to processes that compromise their viability, such as formalin fixed paraffin embedding. 5-hydroxymethylcytosine (5hmC) is a stable covalent DNA modification deposited through the Ten-Eleven Translocation (TET) proteins, that is extensively associated to highly expressed genes and lineage-specific enhancers. Thus, as long as some DNA is present in a sample, 5hmC can be assessed and quantified. Through the integration of multi-omic data, we report a close correspondence between 5hmC-marked regions, chromatin accessibility and enhancer activity in B cells. We then produced generalizable machine learning methods to predict gene expression in multiple cell types using 5hmC as a standalone epigenetic feature. Finally, through the integration of 3D genomic structure data, 5hmC signal and complex machine learning frameworks, we predicted gene expression and enhancer-promoter linkages that are cell-type specific. We revealed regions that were orthogonally validated as enhancers in the literature, or had epigenetic characteristics seen in TET-responsive regulatory elements. The analyzes we conducted here highlight the potential of 5hmC signal to predict gene expression and link enhancers to their target genes, and suggest additional approaches for the study of gene regulatory networks.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.