BioSpike: Efficient search for homologous proteins by indexing patterns
Skip to main content
eScholarship
Open Access Publications from the University of California

BioSpike: Efficient search for homologous proteins by indexing patterns

Abstract

Since the availability of high throughput sequencing tools, the number of known protein sequences has been growing at an unprecedented rate. On the other hand, information about structure or function of proteins is extremely sparse. Biologists that study proteins make extensive use of protein search engines to find homologous sequences whose structure or function are known. One well known measure for sequence similarity is the Smith-Waterman (SW) alignment score. As calculating the SW score is computationally expensive, various approximations for finding homologous sequences have been suggested, and of these the current de-facto standard for protein searching are the BLAST and PSI-BLAST methods of Altschul et al. While BLAST is an efficient approximation algorithm to the optimal SW alignment, it is still, from a computer science standpoint, a very inefficient method as it compares the query sequence to each and every sequence in the database. We present a method for indexing and searching proteins using amino acid patterns. As a source of patterns, we use the BLOCKS library of Henikoff and Henikoff. Position specific scoring matrices are used to identify pattern occurrences. Each iteration consists of a âscanâ in which we identify all statistically significant pattern occurrences in the sequence set; and a refinement stage, in which we use the identified occurrences to define better PSSMs. The final refined PSSMs are then used to index proteins in the UniProt Knowledgebase (UniProtKB), creating an efficient and accurate tool for searching protein homologues.

Pre-2018 CSE ID: CS2006-0858

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View