Most of the efforts in human genetics are directed towards identifying and characterizing genetic variants that impact human traits, achieved by examining relationships between traits and variants. A Genome Wide Association Study (GWAS) quantifies statistical association between genetic variation and phenotypes. These statistical associations can tell us about the biological mechanisms affecting the phenotype and can allow us to predict the phenotype from genetic information in a clinical setting. However, the majority of GWAS datasets have been generated with commodity genotype arrays of single-nucleotide polymorphism (SNP) that fail to explain the majority of heritability for many complex traits even with large sample sizes.
One compelling hypothesis explaining the missing heritability dilemma is that complex variants, such as multi-allelic repeats not in strong linkage with common SNPs, are important drivers of complex traits but are largely invisible to current analyses. Short tandem repeats (STRs), consisting of repeated motifs of 1–6bp in tandem, comprise more than 3% of the human genome. Multiple lines of evidence support a role of STRs in complex traits, particularly in neurological and psychiatric phenotypes. However, existing technologies have not allowed for systematic STR association studies.
To overcome these challenges, we recently generated a reference STR+SNP haplotype panel that enables imputation of STR genotypes into SNP genotypes available for most GWAS cohorts. Our imputation pipeline achieves a high concordance and can be used to impute nearly 500,000 STRs genome-wide. Next, we leveraged our reference haplotype panel to impute STRs into GWAS data for more than 50,000 samples from the Psychiatric Genomics Consortium (PGC) to perform a genome-wide analysis of associations between STR lengths and schizophrenia.
In this dissertation, I demonstrate an end-to-end pipeline for conducting large biobank scale GWAS using STRs that serves as one of the initial studies which researchers can find useful for incorporating complex variants into their analysis.