Regression Modeling and Bias Correction of Ribosome Profiling Data
- Author(s): Tunney, Robert
- Advisor(s): Lareau, Liana F
- Pachter, Lior S
- et al.
Translational regulation is an important control point for gene expression, modulating the quantity and isoforms of proteins produced in a cell. The ribosome profiling method measures translation dynamics and output directly, by sampling the distribution of ribosomes across all mRNA transcripts in a sample. This method has demonstrated that ribosomes do not move at a uniform rate across transcripts, and that synonymous codon choice can have large effects on ribosome speed, RNA stability, and protein expression. We present a regression model using a feedforward neural network to predict the ribosome density at each codon in a transcriptome as a function of the local sequence neighborhood around that codon. This approach demonstrated a collection of sequence features that contain substantial predictive information about translation elongation rates. We apply this model to characterize the translation rates of naturally occurring genes, and also to design translation optimized coding sequences for a given protein. We present a novel and efficient algorithm that finds the fastest and slowest predicted coding sequences for a given protein. We validated our regression model and optimization procedure by designing synonymous variants of eCitrine, a yellow fluorescent protein, across a range of predicted translation rates. Our results showed that the levels of expressed protein closely tracked the predicted overall translation rates of the synonymous coding sequences. This demonstrated that our model captures information determining translation dynamics in vivo, that we can harness this information to design coding sequences, and that control of translation elongation alone is sufficient to produce large, quantitative differences in protein output. Analysis of our regression model also demonstrated that the terminal regions of ribosome footprints are important predictors of footprint density at a given codon. This suggests that ligation events in the experimental protocol are differentially recovering footprints based on their terminal sequences. We characterized this recovery bias both computationally and experimentally, and demonstrated that it can have a large impact on the count of footprints recovered at a given codon. To correct for this error, we developed a generative model of ribosome footprint experiments that incorporates both the biological distribution of footprints across transcripts and the experimental steps that introduce recovery bias. We developed a software tool to estimate the parameters of this model, and present a statistical method to correct recovery bias and estimate the biological distribution of ribosomes across transcripts. This method enables improved estimates of translation for the many applications in which ribosome profiling data is used.