Coding sequences (CDSs) have traditionally been predicted on the basis of sequence and length; on any given mRNA the longest ATG-initiated CDS is annotated as the translated protein product. Ribosome profiling, a technique for sequencing regions of mRNA being decoded by the ribosome, enables a high-resolution view of translation and is thus a method for the data-dependent annotation of CDSs. When performed in the presence or absence of drugs that target the ribosome in stereotyped ways, ribosome profiling reveals the core features of translation: initiation, phased elongation, and termination.
We developed and applied a computational analysis pipeline that leverages all these features to identify CDSs systematically. Notably, our analysis is not biased by length, initiator codon, or overlapping CDS structure. We performed our experiments in bone marrow-derived dendritic cells undergoing lipopolysaccharide (LPS) stimulation. Cell type-specific transcript assemblies guided our analysis, and mass spectrometry corroborated our predictions. We identified translation of both annotated and unannotated CDSs.
Unannotated translation comprises previously unknown variants of annotated proteins and completely novel CDSs. We identified 829 N-terminal truncations and 609 N-terminal extensions, many of which affect localization signals or domain structure. We identified 5,215 CDSs translated from regions upstream of annotated CDSs; these upstream CDSs may regulate expression of their associated downstream CDS. We identified 472 new CDSs that arise from new transcripts or transcripts previously defined as non-coding. Finally, we found 989 translation events that occur within but out-of-frame relative to annotated CDSs.
We used ribosome density to quantify expression of all identified CDSs at various times after LPS stimulation (0, 0.5, 1, 2, 4, 6, 8, 9, 12 hrs). Many newly identified CDSs display significantly enhanced or repressed expression through the time-course, relative to mock-stimulated cells.