Since the availability of high throughput sequencing tools, the
number of known protein sequences has been growing at an unprecedented rate. On
the other hand, information about structure or function of proteins is
extremely sparse. Biologists that study proteins make extensive use of protein
search engines to find homologous sequences whose structure or function are
known. One well known measure for sequence similarity is the Smith-Waterman
(SW) alignment score. As calculating the SW score is computationally expensive,
various approximations for finding homologous sequences have been suggested,
and of these the current de-facto standard for protein searching are the BLAST
and PSI-BLAST methods of Altschul et al. While BLAST is an efficient
approximation algorithm to the optimal SW alignment, it is still, from a
computer science standpoint, a very inefficient method as it compares the query
sequence to each and every sequence in the database. We present a method for
indexing and searching proteins using amino acid patterns. As a source of
patterns, we use the BLOCKS library of Henikoff and Henikoff. Position specific
scoring matrices are used to identify pattern occurrences. Each iteration
consists of a âscanâ in which we identify all statistically significant
pattern occurrences in the sequence set; and a refinement stage, in which we
use the identified occurrences to define better PSSMs. The final refined PSSMs
are then used to index proteins in the UniProt Knowledgebase (UniProtKB),
creating an efficient and accurate tool for searching protein homologues.
Pre-2018 CSE ID: CS2006-0858