Computational methods for analyzing and detecting genomic structural variation : applications to cancer
Understanding genetic variation has emerged as a key research problem of the post-genomic era. Until recently, the study of large genomic events, or structural variants, was marginal in comparison to smaller events, such as single nucleotide variants/polymorphisms. Technological advancements in sequencing, array design, and primer based assays have made the detection structural variants more cost-effective, reopening the possibility of high- throughput, systematic analysis. Here, we propose algorithms utilizing, detecting, and analyzing these events. Cancer is a largely genomic disease driven by somatic mutation and often characterized by large-scale genome rearrangements. We develop optimization schemes for PCR based diagnostics for detecting genomic lesions in cancer patients. The optimization allows robust detection of highly variable genomic lesions, even in a high background of normal DNA. We propose a subtle change to experimental design that significantly improves the assay without impacting experimental complexity. In a separate study, we present an efficient approach for de novo detection of gene fusion events given paired-end sequencing data. Even in low genomic coverage, ̃.6X, with large insert (clone) sizes, >100kb, our method reliably predicts gene fusions. Paired-reads are further applied in reconstructing cancer genome architectures; we focus on local optimizations at complexly amplified or rearranged breakpoints. Large-scale genomic events also play important roles within normal populations and across species. We develop a novel approach that exploits unusual linkage disequilibrium patterns to detect inversion polymorphisms from limited SNP data. For phylogenetic inference, we track the insertion of transposable repeat elements across 28 mammalian species. Our algorithm returns phylogenies highly consistent with other studies and, in some cases, helps resolve points of debate. Lastly, we present a framework for the design of high-throughput sequencing studies directed at transcriptome sequencing, haplotype assembly, and the detection of structural variants. An explicit trade-off is shown between detection and localization of breakpoints for different insert sizes when using paired-reads. We prove that a mix of exactly two insert sizes provides the optimal probability of resolving a breakpoint to a given a resolution. In transcriptome sequencing, we show that it is possible to accurately approximate a sample's underlying gene expression distribution with only 100K reads via a novel correction method.