One of the primary goals of bioinformatics is the identification of the function
of genes. The most reliable way of doing this is through experimentation. However,
this is a very slow and expensive process. While this is necessary in
the beginning and will continue to be necessary for special cases,
it becomes impractical when
one considers the number of different genes encoded in the genomes of every living
organism. A faster way is to instead identify the function of genes
by comparing them to the smaller set of genes with known function. This
comparison may be based on many different kinds of data, including sequence
similarity and gene expression data.
The goal of this dissertation is to provide tools to aid in the identification
of the function of unknown genes. To that end, we first present
a study in which gene expression data was used to annotate many unknown
genes by clustering the expression data. We then present a tool for
clustering gene expression data while also identifying short areas
of high sequence similarity (motifs) among members of the clusters.
Finally, we present a tool for identifying the functionally relevant
sub-sections of protein sequences. These sub-sections can then be used to find
proteins containing similar sub-sections, even though the rest
of the protein may be quite different. This tool can thus find
more distantly related proteins sharing functionally relevant features.