Computational methods and analyses in comparative genomics and epigenomics
- Author(s): Peng, Qian
- et al.
As biological problems are becoming more complex and data growing at a rate much faster than that of computer hardware, new and faster algorithms are required. This dissertation investigates computational problems arising in two of the fields : comparative genomics and epigenomics, and employs a variety of computational techniques to address the problemsOne fundamental question in the studies of chromosome evolution is whether the rearrangement breakpoints are happening at random positions or along certain hotspots. We investigate the breakpoint reuse phenomenon, and show the analyses that support the more recently proposed fragile breakage model as opposed to the conventional random breakage models for chromosome evolution. The identification of syntenic regions between chromosomes forms the basis for studies of genome architectures, comparative genomics, and evolutionary genomics. The previous synteny block reconstruction algorithms could not be scaled to a large number of mammalian genomes being sequenced; neither did they address the issue of generating non-overlapping synteny blocks suitable for analyzing rearrangements and evolutionary history of large-scale duplications prevalent in plant genomes. We present a new unified synteny block generation algorithm based on A-Bruijn graph framework that overcomes these shortcomings. In the epigenome sequencing, a sample may contain a mixture of epigenomes and there is a need to resolve the distinct methylation patterns from the mixture. Many sequencing applications, such as haplotype inference for diploid or polyploid genomes, and metagenomic sequencing, share the similar objective : to infer a set of distinct assemblies from reads that are sequenced from a heterogeneous sample and subsequently aligned to a reference genome. We model the problem from both a combinatorial and a statistical angles. First, we describe a theoretical framework. A linear-time algorithm is then given to resolve a minimum number of assemblies that are consistent with all reads, substantially improving on previous algorithms. An efficient algorithm is also described to determine a set of assemblies that is consistent with a maximum subset of the reads, a previously untreated problem. We then prove that allowing nested reads or permitting mismatches between reads and their assemblies renders these problems NP-hard. Second, we describe a mixture model-based approach, and applied the model for the detection of allele-specific methylations