Full characterization of transcriptomes using long read sequencing
- Author(s): Wyman, Dana Elizabeth
- Advisor(s): Mortazavi, Ali
- et al.
Almost all multi-exonic human genes are believed to undergo alternative splicing, giving rise to isoforms with potentially distinct functions, tissue specificities, and developmental roles. Differential isoform usage has been implicated in both normal developmental processes and in disease states. Much of the previous work attempting to identify and to quantify individual gene isoforms has been performed using short-read RNA sequencing on the Illumina platform. While this technology is considered the state of the art for quantifying gene expression, short reads are unable to accurately resolve full-length mammalian isoforms, which can be multiple kilobases long. Although computational methods have been developed to reconstruct isoforms from short reads, these are not able to overcome the fundamental limitations of the technology.
Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short reads and offer the additional advantage of sequencing single molecules individually. PacBio sequencing in particular has been used extensively for de novo isoform reconstruction but was previously
not deemed useful for quantitative measurements of gene or transcript expression due x
both to the cost of the assay and to its relatively low throughput. However, technical advances have increased the yield as well as the accuracy of longer reads, presenting an opportunity to use these technologies directly for isoform-level quantification. Here, I present novel methods for long-read error correction, isoform discovery, and quantification in RNA samples from both pooled and single cells. First, I introduce TranscriptClean, a program that leverages a reference genome to correct common sequencing errors in long reads. Next, I describe TALON, a technology-agnostic approach to discovering and quantifying isoforms in multiple long-read datasets. Finally, I apply TALON to the analysis of deeply sequenced single cells from the developing mouse limb bud, demonstrating that long reads can provide key biological insights in the context of development. Together, these projects help pave the way for long-read transcriptome analyses on both the bulk and single-cell level, which grant us new insights into isoform expression across diverse human and mouse tissues.