Efficient Algorithms for Human Genetic Variation Detection using High-throughput Sequencing Techniques
- Author(s): He, Dan
- Advisor(s): Eskin, Eleazar
- et al.
High-throughput sequencing (HTS) technologies are one type of genome sequencing techniques where short DNA segments, or reads, are sequenced or sampled from genome. Compared with the traditional genome sequencing techniques, they have advantages such as low-cost and they are able to parallelize the sequencing process to produce millions of reads. These technologies have been widely used in many important problems related to human genetic variations. We mainly target three human genetic variation problems with the reads generated by HTS.
It is well-known that human individuals differ from each other by 0.1%. The majority of the differences is in the form of SNPs, or Single Nucleotide Polymophisms. Haplotypes, defined as the sequences of SNPs on each chromosome of a human genome, are important for problems such as imputation of genetic variants, relatedness of human individuals, etc. A difficulty in haplotype inference is the presence of sequencing errors and a natural formulation of the problem is to infer haplotypes which are most consistent with the data from a combinatorial perspective. Unfortunately, this formulation of the haplotype assembly is known to be NP-hard. We proposed a few techniques including dynamic programming, MaxSAT and Hidden Markov Model (HMM) to solve the problem optimally from different perspectives.
Structural variations and in particular Copy Number Variations
(CNV) have dramatic effects of disease and traits. We first proposed
an efficient algorithm to detect and reconstruct CNVs in unique genomic regions,
where the sequencing reads generated from HTS are mapped to a
reference genome and signatures indicating the presence of a CNV
are identified. Then we extend the algorithm to a much more challenging problem where CNVs are in repeat-rich regions and the reads may be mapped to multiple mapping positions. To our knowledge, our method is the first attempt to both identify and reconstruct CNVs in repeat-rich regions.
Recent advances in sequencing technologies set the stage for large population based studies, in which the DNA or RNA of thousands of individuals will be sequenced. A few multiplexing schemes have been suggested, in which a small number of DNA pools are sequenced, and the results are then deconvoluted using compressed sensing or similar approaches. These methods, however, are limited to the detection of rare
variants. We provide a new algorithm for the deconvolution of DNA pools multiplexing schemes. The presented algorithm utilizes a likelihood model and linear programming and is able to genotype both low and high allele frequency SNPs with microarray genotyping and imputation.