Deciphering genomic grammar of regulatory sequences
- Li, Rick Z
- Advisor(s): Glass, Christopher
Abstract
Multicellular organisms contain hundreds of cell types that share the same genome, yet each cell type has a unique set of expressed genes. Regulation of cell-type specific gene expression is primarily governed by non-coding regions called enhancers. Enhancers are activated through the binding of transcription factors in response to extra- and intra-cellular signals, resulting in the modulation of cell-type/state-specific gene expression. Growing evidence suggests that enhancer selection and activation require collaborative binding of multiple transcription factors. Enhancer sequences, especially the arrangements of multiple transcription factor binding motifs in each enhancer, provide valuable insights into the underlying regulatory mechanisms for each cell type. Indeed, the spatial arrangements of transcription factor binding motifs within enhancer sequences resemble human language, where word/token interactions are highly dynamic and complex. To characterize cell type-specific enhancer sequences, I began with a systematic analysis of the different types of spacing relationships between transcription factor pairs and discovered that most transcription factor pairs can tolerate relaxed spacings in between their binding sites. Next, I utilized co-occurrence, a linguistics-inspired concept, to develop TIMON, a computational tool to identify co-occurring transcription factor motifs. Integrating TIMON results with multi-omics data, I identified key transcription regulators for microglia development. Lastly, I leveraged advanced natural language processing techniques to develop TIANA, an interpretation-oriented deep learning framework that enables the identification of transcription factor interactions from regulatory sequences. Applying TIANA on transcription factor ChIP-seq and enhancer datasets demonstrated its ability to identify transcription factor motif interactions that are consistent with experimental findings. Taken together, this thesis provides novel insights into regulatory mechanisms through the lens of genomic grammar.