The accurate identification of orthologous genes across different species is a critical and challenging problem in comparative genomics and has a wide spectrum of biological applications including gene function inference, evolutionary studies and systems biology. During the past several years, many methods have been proposed for ortholog assignment based on sequence similarity, phylogenetic approaches, synteny information, and genome rearrangement. Although these methods share many commonly assigned orthologs, each method tends to produce an ortholog assignment significantly different from the others.
In this dissertation, we study the problem of assigning orthologous genes among closely related genomes on a genome scale. We first give a brief review of the existing methods for ortholog assignment in the literature, followed by a comprehensive comparison of each method. We then propose a new combinatorial approach for assigning ortholog pairs between a pair of closely related genomes by addressing the limitations of the existing methods. Our approach is based on the parsimony principle to transform one genome to another by minimizing the number of genome rearrangement events, including reversal, transposition, fusion, fission and gene duplications. By explicitly incorporating tandem gene duplication model and combining phylogenetic approaches, we develop an improved system MSOAR 2.0. Our experimental results on both simulated data and real data show that MSOAR 2.0 achieves the highest overall prediction accuracy among different programs in comparison.
Based on pairwise genome comparison results, we extend our ortholog assignment method to multiple genome comparison and develop a new system MultiMSOAR 2.0 to identify ortholog groups among multiple genomes. In MultiMSOAR 2.0, pairwise orthology information produced by MSOAR 2.0 is used to construct multipartite graphs for each gene family. In order to partition each gene family into a set of disjoint sets of orthologous genes, a multidimensional matching problem is formulated and a heuristic maximum weight matching algorithm is proposed. The partition results are then used to label the species tree. Considering some biological constraints, we formulate the tree labeling problem in the combinatorial optimization framework and develop two dynamic programming algorithms to solve the problem. Our experimental results show that MultiMSOAR 2.0 achieves much higher prediction accuracy than the existing ortholog assignment systems for multiple genomes. Moreover, MultiMSOAR 2.0 also provides information about gene births, duplications and losses in evolution, which may be of independent biological interest.