In this dissertation, we study several topics in genetics, including tag SNP selection, haplotype inference, error detection, and horizontal gene transfer detection. We formulate these problems as computational optimization problems, discuss the complexity, present our novel algorithms, and demonstrate the experimental results.
We first study the genome-wide tag SNP selection problem, propose a new model of multi-marker correlation for the problem, and present a greedy algorithm to select a smallest possible set of tag SNPs according to the model. Our experimental results on several real datasets from the HapMap project demonstrate that the new model yields more succinct tag SNP sets than the previous methods.
We then study how to infer haplotypes from genotype data which may contain genotyping errors, de novo mutations and missing alleles. We assume that there are no recombinants in the genotype data, which is usually true for tightly linked markers.
We prove the problem is NP-hard, and propose a heuristic algorithm, the core ofwhich is an integer linear program (ILP) using the system of linear equations over Galois field GF(2). Our experimental results show that the algorithm can infer haplotypes with a very high accuracy, and recover 65%-94% of genotyping errors depending on the pedigree topology.
We also study the detection of mutations, sequencing errors, and horizontal gene transfers in a set of closely related microbial genomes which do not align well because of rearrangements. We use a new SNP definition to handle the rearrangement problem, divide the problem into several optimization subproblems, and propose a series of algorithms to tackle each subproblem. Results from simulation experiments show that we can detect 31%-61% of horizontal gene transfer events depending on the mutation and missing rates, and the precision of our detection is about 48%-90%.