Score-based Transcription Factor Motif Enrichment Strategies For Analysis of Transcriptional Regulation Of Immune Response
Transcription factors (TFs) mediate transcriptional responses, allowing cells to respond to changing internal or external stimuli, including infection. The DNA sequences bound by TFs in regulatory regions are called motifs. Researchers employ Motif Enrichment Analysis (MEA) methods to study transcriptional regulation, which analyzes DNA sequences from regulatory regions and determines the statistical overrepresentation of motifs in those sequences, allowing inference of relevant TFs for that set of regulatory regions. However, most MEA tools feature oversimplifications of one or more pertinent axes of the data, obscuring potential insights into transcriptional regulation. For example, most MEA require thresholding of DNA sequences into two sets in order to determine motif enrichment, thus oversimplifying the underlying biological scores that determine those sets, which can prevent biological discovery. We introduce Motif Enrichment In Ranked Lists of Peaks (MEIRLOP), a score-based MEA method that allows researchers to determine the enrichment of motifs within a dataset of scored regulatory region DNA sequences. MEIRLOP uniquely utilizes a logistic regression model that also accounts for lower order levels of sequence bias and other covariates. We demonstrate its utility on multiple ChIP-seq datasets, where it proves more capable (relative to other methods) of finding the enrichment of key transcription factor binding motifs, including the enrichment of binding sites key to immune response. An overlooked axis in most MEA is the position of motifs relative to anchor features such as transcription start sites (TSS), which can be characterized at high positional resolution using capped short RNA-sequencing (csRNA-seq). We introduce Motif Enrichment Positional Profiling (MEPP), which uses specialized convolutional neural networks to create a profile that characterizes motif enrichment at different motif positions over a dataset of scored sequences. We also introduce Learning Motifs from Positional Priors (LMPP), which uses machine learning to perform the opposite of MEPP: Learning a motif whose positional enrichment resembles a target profile. We use both methods to analyze multiple TSS from csRNA-seq datasets, revealing the positional preferences of transcription factors key to antibacterial and antiviral responses. Overall, this dissertation presents novel methods by which researchers may analyze transcriptional regulation in and beyond immune response.