A cross-organism framework for supervised enhancer prediction with epigenetic pattern recognition and targeted validation
- Sethi, Anurag;
- Gu, Mengting;
- Gumusgoz, Emrah;
- Chan, Landon;
- Yan, Koon-Kiu;
- Rozowsky, Joel;
- Barozzi, Iros;
- Afzal, Veena;
- Akiyama, Jennifer;
- Plajzer-Frick, Ingrid;
- Yan, Chengfei;
- Pickle, Catherine;
- Kato, Momoe;
- Garvin, Tyler;
- Pham, Quan;
- Harrington, Anne;
- Mannion, Brandon;
- Lee, Elizabeth;
- Fukuda-Yuzawa, Yoko;
- Visel, Axel;
- Dickel, Diane E;
- Yip, Kevin;
- Sutton, Richard;
- Pennacchio, Len A;
- Gerstein, Mark
- et al.
Published Web Location
https://www.biorxiv.org/content/10.1101/385237v1.fullAbstract
Enhancers are important noncoding elements, but they have been traditionally hard to characterize experimentally. Only a few mammalian enhancers have been validated, making it difficult to train statistical models for their identification properly. Instead, postulated patterns of genomic features have been used heuristically for identification. The development of massively parallel assays allows for the characterization of large numbers of enhancers for the first time. Here, we developed a framework that uses Drosophila STARR-seq data to create shape-matching filters based on enhancer-associated meta-profiles of epigenetic features. We combined these features with supervised machine learning algorithms (e.g., support vector machines) to predict enhancers. We demonstrated that our model could be applied to predict enhancers in mammalian species (i.e., mouse and human). We comprehensively validated the predictions using a combination of in vivo and in vitro approaches, involving transgenic assays in mouse and transduction-based reporter assays in human cell lines. Overall, the validations involved 153 enhancers in 6 mouse tissues and 4 human cell lines. The results confirmed that our model can accurately predict enhancers in different species without re-parameterization. Finally, we examined the transcription-factor binding patterns at predicted enhancers and promoters in human cell lines. We demonstrated that these patterns enable the construction of a secondary model effectively discriminating between enhancers and promoters.