- Main
Analytic Methods for Next-Generation Sequencing Studies of Chromatin Structure and 3D Organization
- Capurso, Daniel
- Advisor(s): Segal, Mark R
Abstract
Beyond linear sequence, higher order structure of the genome influences gene regulation and has been implicated in disease. Chromatin structure is the degree of chromatin compaction at genomic loci. Chromatin organization is the spatial, three-dimensional (3D) positioning of chromatin. Here, we adapt and apply methods for next-generation sequencing analyses of chromatin structure and organization based on chromatin immunoprecipitation-sequencing (ChIP-seq) and genome-wide chromosome conformation capture (Hi-C), respectively. First, we built on a previous study that sought to classify nucleosomes containing either H2A.Z or H2A/H4 arginine 3 symmetric dimethylation (H2A/H4R3me2s) from human ChIP-seq data. We hypothesized that appropriate data preprocessing – deduplication, normalization for sequencing depth, and position-finding – in conjunction with advanced algorithms for feature selection (Discriminatory Motif Feature Selection) and classification (Random Forest) would improve performance. We achieved dramatically improved classification accuracy and identified a significant and biologically meaningful DNA motif associated with H2A/H4R3me2s: “TCCATT”, which is part of the consensus sequence of satellite II and III DNA. Second, we tested our hypothesis that there are advantages to assessing the 3D co-localization of functional annotations (e.g., centromeres) using 3D genome reconstructions from Hi-C contact data because they enable detection of multi-level interactions (assessments using contact data are inherently limited to detecting strictly pairwise interactions). We found significant 3D co-localization of sets of genes with developmentally regulated expression in Plasmodium falciparum with 3D reconstruction-based assessment but not with contact-based assessment. Further, we developed a method for 3D reconstruction-based assessment that avoids the data dichotomization of previous approaches. Third, we tested our hypothesis that analyzing ChIP-seq data in combination with 3D reconstructions could identify functional 3D hotspots. We separately overlaid a Saccharomyces cerevisiae 3D genome reconstruction with three ChIP-seq inputs and contrasted two algorithms for identifying regions in 3-space — 3D hotspots – for which mean ChIP-seq peak height is significantly elevated: k-Nearest Neighbor (k-NN) regression and the Patient Rule Induction Method (PRIM). For each ChIP-seq input, both algorithms identified significant, corresponding and biologically meaningful 3D hotspots containing distal genomic regions. Our research demonstrates that applying appropriate data preprocessing and advanced supervised learning algorithms improves the interpretability of next-generation sequencing studies of chromatin structure and organization.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-