In the last five years, high throughput sequencing has revolutionized biological research. The ability to quickly generate millions of short sequence reads enables studies that would have been inconceivable even 10 years ago. This work focuses on RNA-Seq, the application of high throughput sequencing to an organism's transcriptome. We describe a method of library preparation that improves sequence coverage, a new algorithm for detecting splice junctions in the datasets, and finally, application of these techniques to the study of splicing in Plasmodium falciparum.
The long march is a technique for Solexa library preparation that increases contig length and target sequence coverage. The long march incorporates a Type IIS restriction enzyme into the sequencing primer adapter. Each round of marching cuts off the initial part of the read and ligates a new adapter downstream, creating overlapping reads. Validation on P. falciparum genomic and human hepatitis B virus positive samples showed 39% and 42%, respectively, increases in numbers of bases covered.
Next we developed an algorithm to detect spliced reads crossing exon-exon junctions in RNA-Seq datasets. Our algorithm uses an unbiased approach, relying only on the read dataset and a reference genome, detecting canonical and noncanonical splice junctions. This works by dividing reads in half for initial seeding in the reference genome then using an HMM, trained on the input data, to determine the optimal splice position. Our algorithm provides a score for each splice junction, which allows researchers to tune the false positive rate to the requirements of their experiment. This approach identifies more splice junctions than currently available algorithms, without a reduction in specificity, when tested on publicly available datasets for Arabidopsis thaliana, Plasmodium falciparum, and Homo sapiens.
Finally, our library preparation technique and splice detection algorithm were used to study splicing in P. falciparum. Both our data and publicly available datasets were used to identify splicing events in the blood stages of the parasite. We confirmed 6,678 previously known introns and identified 977 novel introns with canonical splice edges. In addition, we detected 310 alternative slicing events as well as splicing events antisense to known transcripts.