The Promoter Sequence Basis for Transcription Regulatory Activity
The transcriptional regulatory network (TRN) of E. coli MG1655 contains thousands of regulatory interactions between transcription factors, sigma factors and promoter sequences of genes. The membership of a regulon, in other words whether a gene is regulated by a transcription factor, has a basis in the gene’s promoter. To determine the TRN, there are two primary families of methods: bottom-up identification of binding sites via methods such as ChIP-seq experiments, and top-down inference of the TRN via expression analysis methods such as independent component analysis (ICA). In this work, the promoter sequence of each gene was utilized to predict regulons with machine learning. Certain promoter features, such as TF binding site motif scores, DNA shape features, and sigma factor-related features were found to be essential to predict whether a gene is regulated by particular transcription factors. ICA and ChIP TRNs were compared to investigate factors underlying their difference, and two case studies were carried out. An FNR case study showed that ICA regulons fractionate a large ChIP regulon into several regulation types. An ArcA case study demonstrated that the ICA TRN captures diversity in binding sites’ architecture. In general, through comparison, ICA TRN extracts genes of strong regulation activity from ChIP TRN. We then expanded this analysis to understand differences in the regulons of multiple strains of E. coli. A pan-regulon of Fur was reconstructed with unique, accessory and core regulons annotated. We found that genes in the core regulon have stronger regulation activity than genes in the unique regulon. Additionally, it was also found that the motif score and helix twist of DNA sequence were both significant indicators of Fur ChIP-exo peak heights, represented by S/N ratios. This study was a meaningful application of machine learning on biological problems that probed biophysical factors underlying omics data, giving directions for the genome design in synthetic biology aiming to control the phenotype by TRN tuning.