Algorithms for Human Genetics
Whereas Mendel used breeding experiments and painstakingly counted peas, modern biology increasingly requires computational tools. In the late 1800's probability and experimental genetics were the critical tools for discovering the gene. Today, the combined use of statistical and computational methods to make genetic and genomic discoveries has increased after the discovery of the DNA double-helix and the development of sequencing methods. By examining relationships among individuals using computational tools, geneticists have been able to understand the biological mechanisms that produce genetic diversity, map ancestral movements of populations, reconstruct ancestral genomes, and identify relatives. Furthermore, models in genetics have inspired advances in computer science, notably the model for inheritance in families is an early example of a graphical model and helped inspire the sum-product algorithm.
The genetic data of interest is single-nucleotide polymorphism (SNP) data, which are positions in the genome known to have nucleotide variation across the population. Humans are diploid individuals having two copies of each chromosome. Data for an individual can come in two forms, either haplotypes or genotypes. The haplotypes are two strings, each giving the sequence of nucleotides that appear together on the same chromosome. The genotypes, for each position in the genome, give an unordered set of nucleotides that appear. In particular the genotype is said to be `unphased' due to the lack of information about which nucleotide appears on which chromosome.
In human genetics there are two main ways to model relatedness: evolutionary relationships between people and closer, family relationships. Evolutionary relationships, from the domain of population genetics, occur through a distant relative and leave small traces of the relationship in the genome. Family relationships are typically much closer and leave much larger traces in the genome. This thesis examines algorithms for both types of relationships.
For evolutionarily related individuals, this thesis presents the perfect phylogeny and coalescent and then examines two related questions. The first is related to privacy of genetic data used for research purposes. In order to share data from studies while hopefully maintaining the privacy of study participants, geneticists have released the summary statistics of the data. A natural question, whether individuals can be detected in the summary data, is answered in the affirmative by using a perfect phylogeny model. The second question is how to construct perfect phylogenies from haplotypes where there is missing data. We introduce a polynomial-time algorithm for enumerating such phylogenies. This algorithm can be used to compute the probability of the data as an expectation over possible coalescent genealogies.
Recent relationships are modeled using a family tree, or pedigree graph. Traditionally, geneticists construct these graphs from genealogical records in a very tedious process of examining birth, death, and marriage records. Invariably mistakes are made due to poor record keeping or incorrect paternity information. As an alternative to manual methods, this thesis addresses the problem of automatically constructing pedigree graphs from genetic data.
The most obvious way to reconstruct pedigrees from genetic data is to use a structured machine learning approach, similar to phylogenetic reconstruction. That method would involve a search over the space of pedigree graphs where the objective is to find the pedigree graph with the highest likelihood of generating the observed data. Unfortunately, this is not a good way to proceed for two reasons: the space of pedigree graphs is exponential, and the likelihood calculation has exponential running time. The likelihood calculation given genotype data is known to be NP-hard. In an attempt to make use of the likelihood in complex pedigrees, the method PhyloPed uses a Gibbs sampler to infer haplotypes from genotype data. In a second attempt to use likelihood methods, this time for haplotype data, an NP-hardness result is presented. A third attempt to find an efficient algorithm for the likelihood problem results in a state-space reduction method for the pedigree hidden Markov model.
Since likelihood-based approaches seem completely infeasible, a completely different approach is introduced. We focus on the problem of inferring relationships between a set of living individuals with available identity-by-descent data. For convenience, we assume that the inferred pedigree is monogamous without inter-generational mating. Two heuristic and practical pedigree reconstruction methods are introduced, one for inbred pedigrees and the other for outbred pedigrees. This work immediately reveals another important problem, that of evaluating the resulting inferred pedigree against a ground-truth pedigree. This can be done either by determining whether the two pedigrees are isomorphic or by finding the edit distance between the two pedigrees.