RNA-Seq Based Transcriptome Assembly: Sparsity, Bias Correction and Multiple Sample Comparison
- Author(s): Li, Wei;
- Advisor(s): Jiang, Tao;
- et al.
RNA-Seq, or deep-sequencing of RNAs, is a new technology for transcriptome profiling using second generation sequencing. RNA-Seq has been widely used to identify and quantify transcriptomes at an unprecedented high resolution and low cost. An important computational problem arising from RNA-Seq is transcriptome assembly, in which the structures of transcripts (and their expression levels) are inferred simultaneously from RNA-Seq data. RNA-Seq transcriptome assembly allows for the detection of structural and quantitative changes of transcripts between samples, paving the way for novel biological discoveries. However, the problem of RNA-Seq transcriptome assembly is challenging because: (i) the complicated alternative splicing patterns of some genes result in a huge number of possible transcripts, (ii) different kinds of biases in RNA-Seq reads (including sequencing, positional and mappability biases) decrease the accuracy of assembly and expression level estimation algorithms, and (iii) the existing assembly tools can only reconstruct transcripts from a single sample, leading to a high false positive rate for comparing RNA-Seq experiments from multiple samples.
We propose three different algorithms to address these challenges. First, we design a transcriptome assembly tool, IsoLasso, that balances different objectives (prediction accuracy, sparsity, interpretation) and takes advantage of the sparsity of expressed transcripts. Second, we use the quasi-multinomial distribution to model the RNA-Seq biases, and design a new algorithm, CEM, to handle different biases in both transcriptome assembly and transcript expression level estimation. Finally, we propose a multiple-sample transcriptome assembly tool, ISP, to assemble transcripts directly from RNA-Seq data of multiple samples. ISP reaches an improved performance compared to the assembly tools that consider one sample at a time, and helps to improve the accuracy of downstream differential analysis of transcriptomes between samples.