In the human genome, the vast majority of DNA is non-coding. Although non-coding DNA does not directly encode protein sequences, they are vital to the transcriptional regulation of the protein-coding process. Recent genome-wide association studies (GWAS) have shown that ~93% of genetic variants driving common human traits and diseases lie within non-coding sequences. However, due to the complicated and indirect functions of these non-coding genetic variants, it is difficult for traditional analysis metrics to sift through the large number of non-coding sequences and pinpoint the variants casual to human diseases and traits.
In this dissertation, I present AgentBind, a deep learning framework that identifies and interprets sequence features most predictive of regulatory activities, such as transcription factor binding, histone modification, and chromatin accessibility. I demonstrate that AgentBind is applicable to diverse types of biological tasks, including (1) pinpointing sequence features most important for transcription factor binding; (2) prioritizing genetic variants in transcriptional enhancers associated with human brain disorders; and (3) identifying the dominant combinations of lineage-determining and signal-dependent transcription factors driving enhancer activation in mice. Collectively, these studies provide a valuable deep learning framework and its use cases in decoding the rules within non-coding regulatory regions and identifying specific non-coding nucleotides with the strongest effects on human traits and diseases.