Computational methods for genome-wide non-coding RNA discovery and analysis
- Author(s): Zhang, Shaojie
- et al.
The discovery of novel non-coding RNAs has been among the most exciting recent developments in Biology, yet, many more remain undiscovered. It has been hypothesized that there is in fact an abundance of functional non-coding RNAs (ncRNAs) with various catalytic and regulatory functions. Computational methods tailored specifically for ncRNA discovery are being actively developed. As the inherent signal for ncRNA is weaker than that for protein coding genes, comparative methods offer the most promising approach. In this dissertation, we address several open issues and problems on computational methods for genome wide non-coding RNA discovery and analysis: (1) We first consider the following problem: Given an RNA sequence with a known secondary structure, efficiently detect all structural homologs in a genomic database by computing the sequence and structure similarity to the query. Our approach, based on structural filters that eliminate a large portion of the database, while retaining the true homologs, allows us to search a typical bacterial genome in minutes on a standard PC. This results is two orders of magnitude better than currently available software for the problem. (2) We formalize the concept of a filter and provide figures of merit that allow comparison between filters. We design efficient sequence based filters that dominate the current state-of-the-art HMM filters. We provide a new formulation of the covariance model that allows speeding up RNA alignment. We demonstrate the power of our approach on both synthetic data and real bacterial genomes. We then apply our algorithm to the detection of novel riboswitch elements from the whole bacterial and archaeal genomes and environmental sequence data. Our results point to a number of novel riboswitch candidates, and include genomes that were not previously known to contain riboswitches. (3) We propose a novel framework to predict the common secondary structure for unaligned RNA sequences. By matching putative stacks in RNA sequences, we make use of both primary sequence information and thermodynamic stability for prediction at the same time. We show that our method can predict the correct common RNA secondary structures even when we are only given a limited number of unaligned RNA sequences, and it outperforms current algorithms in sensitivity and accuracy. Together these contributions made efforts toward genome wide ncRNA discovery for exploring the modern RNA world