Analysis and applications of conserved sequence patterns in proteins
- Author(s): Ie, Tze Way Eugene
- et al.
Modern sequencing initiatives have uncovered a large number of protein sequence data. The exponential growth of these databases are not matched by the rate at which we are annotating them. Reliable structural and functional annotations for protein sequences are limited, and computational methods have been steadily developed to bridge this knowledge gap. This dissertation develops a number of computational techniques for analyzing protein and genomic sequences. They are based heavily on the use of statistics and modern machine learning algorithms. First, we introduce an application of support vector machines and structured output codes for the problem of discriminating protein sequences into one of many protein structural groups. Although our method works with any type of base binary classifiers, we found that it works best when the base classifiers leverage unlabeled protein sequences. The need to quickly identify similar protein sequences motivates our next contribution, an indexed- based approach to protein search. We develop a novel indexed-based framework to protein sequence search. The search index is based on robust statistical models of conserved sequence patterns. The user of our system can essentially plug in any existing protein motif libraries to increase the coverage of the index. Furthermore, the framework can systematically refine any bootstrapped profile patterns using large amounts of unannotated sequence data available today. We further supplement the system with a novel random projections-based algorithm for finding motifs that are prevalent across many protein sequences. Finally, we outline a new computational problem of finding protein coding regions in microbial genome fragments. This is of particular interest to recent explorations in metagenomics where the microbial communities under scrutiny are increasingly complex. Highly complex metagenomes usually observe lower sequence redundancy for the same amount of sequencing, rendering fragment assembly as an infeasible pre-processing step. We develop a novel evidence integration approach for finding genes on metagenomics fragments requiring no fragment assembly