Decomposing the symphony of Escherichia coli gene expression
- Author(s): Sastry, Anand Varun
- Advisor(s): Palsson, Bernhard O
- et al.
Bacteria respond and adapt to dynamic environments by altering their gene expression through a complex Transcriptional Regulatory Network (TRN). Advances in sequencing technologies have accelerated the generation of large RNA sequencing datasets that can be leveraged to probe the TRN. Here, we analyze large gene expression datasets using Independent Component Analysis (ICA), an unsupervised machine learning algorithm developed to separate source signals (e.g. individual instruments) from a set of mixed signals (e.g. recordings of an orchestra). First, we compile a high-quality RNA-seq compendium containing over 250 expression profiles for the model bacteria Escherichia coli. We apply ICA to decompose this compendium into independently modulated sets of genes, termed i-modulons, that represent the source signals comprising the transcriptome. Of the 92 i-modulons, 61 capture the genome-wide targets of characterized transcriptional regulators with high accuracy. ICA simultaneously estimates the activity of each transcriptional regulator in each growth condition. We show that this representation of the TRN can be used to predict and validate new regulatory interactions, characterize mutations in transcription factors, and provide a basis to compare gene expression across multiple E. coli strains. Next, we show that the underlying structure of the E. coli transcriptome, as determined by ICA, is conserved across multiple independent RNA-seq and microarray datasets. We subsequently combine the datasets into a compendium containing over 800 expression profiles, and discover that the ICA-based structure is still maintained upon data integration. Echoes of this structure were also found in two proteomics datasets, accelerating biological discovery through multi-omics analysis. Finally, we investigate how gene expression links environmental conditions to antibiotic efficacy. We show that i-modulons simplify the analysis of complex transcriptional changes to enable rapid characterization of cellular states beyond the transcriptome. As a whole, this body of work introduces ICA as a compelling tool to integrate and understand large gene expression datasets.