Machine learning approaches for relating genomic sequence to enhancer activity and function
- Author(s): Tao, Jenhan;
- Advisor(s): Glass, Christopher K;
- Benner, Christopher
- et al.
Despite the advent of high throughput genomics technology and the wealth of data characterizing transcription that followed, it remains difficult to relate genomic sequence to transcriptional activity. Next generation sequencing techniques, including ChIP-seq, RNA-seq, and ATAC-seq, have enabled high resolution mapping of transcriptional activity, including RNA expression and histone modifications, as well as the localization of transcription factors and DNA binding proteins that regulate transcription. By integrating of these activity maps using statistical methods and high-performance computing, a model has emerged in which transcription factors recognize and bind to short DNA sequence motifs (“words”) to recruit cellular machinery such as RNA polymerase, which is necessary for transcription. Previous studies have also demonstrated that transcription factors often bind together in a cell type and context specific manner, setting the foundation for a genomic grammar in which combinations of transcription factors recognize "sentences" that specify cell type and context specific transcriptional activity. Using this foundational model as our starting point, we devised a machine learning framework named TBA (a Transcription factor Binding Analysis), for investigating the sequence specificity of transcription factors by jointly weighing the contributions of hundreds of DNA motifs. We applied TBA to a systematic map of the binding profiles for the AP-1 transcription factor family, which share a conserved DNA binding domain. We observed that each family member demonstrated interactions with distinct sets of motifs, which varied from cell type to cell type, and in different cellular states. Next we applied the TBA framework to hundreds of transcription factor ChIP-seq data sets, demonstrating that like AP-1, transcription factors generally interact with dozens of other transcription factors genome-wide and with 3-4 transcription factors at a given locus in a cell-type specific manner. We used these findings describing transcription factor behavior to devise a neural network with an attention mechanism that calculates locus specific maps of how motifs interact to predict transcriptional activity. These studies demonstrate machine learning approaches that reveal additional insight into a transcriptional grammar that coordinates eukaryotic gene expression.