A Study on Correlations between Genes' Functions and Evolutions
- Author(s): Bejraburnin, Natth;
- Advisor(s): Pachter, Lior;
- et al.
Genes are functional units in organisms' genomes that are believed to play an important role in how organisms develop their inheritable characteristics such as eye and hair colors. One of the fundamental problems in computational biology is to predict gene functions from DNA sequence data. Researchers have developed several methods, both experimental and computational, to tackle the problem and some of those methods rely on an assumption that genes that have similar functions would likely evolve in a correlated fashion and vice versa. I aim to investigate this assumption under a statistical framework.
I define a measure that quantifies the dissimilarity level of the evolutions of any pair of genes. In order to properly define the measure, I use an evolutionary model to represent the evolution of each gene. The model essentially serves as an encoding of the distribution of all possible character sequences that could be observed at the leaf nodes of the model. Then the measure between any two evolutionary models is precisely defined as the Kullback-Leibler divergence between the two distributions, encoded by the models. Since computing the exact measures are not computationally tractable, I instead propose an efficient algorithm for estimating them. Genes' functions are determined based on the Gene Ontology consortium database. In the end, I apply statistical tests for clustering to verify if genes with correlated evolutions tend to have similar functions. I find that the hypothesis is not always true. There are some groups of genes whose functions are not correlated with their evolutions and there are some other groups of genes whose functions and evolutions are correlated well.
In addition, the methods presented in this research lay out a framework for any studies that involve quantitative analysis on genes' or proteins' evolutions. This thesis exhibits one application in this framework, which focuses solely on genes' functions. But the methods can be applied to other type of attributes such as proteins' structures.