In recent years, the advent of genotyping and sequencing technologies has enabled human genetics to discover numerous genetic variants. Genetic variations between individuals can range from Single Nucleotide Polymorphisms (SNPs) to differences in large segments of DNA, which are referred to as Structural Variations (SVs), including insertions, deletions, and copy number variations (CNVs). Genetic variants play an important role in regulating human diseases and traits.
I first propose an efficient genotyping method which can accurately report the genotypes of thousands of individuals over a high-density SNP map at low cost. This method utilizes pooled sequencing technology and imputation. A probabilistic model, CNVeM, is then developed to detect CNVs from High-Throughput Sequencing (HTS) data. I demonstrate by experiment that CNVeM can estimate the copy numbers and boundaries of copied regions more precisely than previous methods.
Genome wide association studies (GWAS) have discovered numerous individual SNPs involved in genetic traits. However, it is likely that complex traits are influenced by interaction of multiple SNPs. I propose a two-stage statistical model, TEPAA, to reduce computational time greatly while maintaining almost identical power to the brute force approach which considers all possible combinations of SNPs. The experiment on the Northern Finland Birth Cohort data shows that TEPAA achieved 63 times speedup.
Another drawback of GWAS is that rare causal variants will not be identified. Rare causal variants are likely to have been introduced in a population recently and are likely to be in shared Identity-By-Descent (IBD) segments. I propose a new test statistic to detect IBD segments associated with quantitative traits. I make a connection between the proposed statistic and linear models so that it does not require permutations to assess the significance of an association. In addition, the method can control for population structure by utilizing linear mixed models.