Deep Characterization of the Contribution of Short Tandem Repeats Across Tissues
High-Throughput Sequencing (HTS) and Genome-Wide Association Studies (GWAS) studies have given us unprecedented insights into the influence of Single Nucleotide Variants (SNV) and Copy Number Variants (CNV) on different phenotypes including gene expression, diseases, and complex traits. However, how other complex genetic variations such as Short Tandem Repeats (STRs) in the genome may affect gene expression remains largely unknown. Identifying and genotyping these types of variants from short DNA sequencing reads or low coverage data present difficult bioinformatics challenges. Additionally, traditional association tests must be modified to handle highly multi-allelic loci such as STRs. Several studies have examined the effect of STRs on gene expression genome-wide. However, these studies were restricted to a single cell type such as whole blood or lymphoblastoid cell lines (LCLs) and had limited power to detect associations due to low-quality genotypes. Thus, the results of these studies have had limited biological insights and interpretation in different contexts.
In this dissertation, we address the importance of incorporating STRs in causal screening and large-scale medical genetics studies. We perform the first and largest yet characterization of STRs that contribute to gene expression variation across multiple tissues. To assure robust and reliable outcomes and insights, we leverage data from the GTEx project, which has collected high coverage whole genome sequencing data and RNA-sequencing across dozens of tissues, for more than 600 individuals. Our work confirms a clear contribution of STRs to gene expression regulation, with 25,554 eSTRs identified across 17 tissues. Of these, 14% are identified as high confidence causal variants after fine-mapping against nearby SNPs. eSTRs are highly enriched at predicted promoter and enhancer regions and for motifs with high GC-content. We identified a subset of eSTRs capable of forming G-quadruplexes (G4), a highly stable DNA secondary structure known to be involved in gene regulation. We show that long G4-forming STRs tend to increase expression of nearby genes, potentially by lowering the free energy of promoter regions and promoting RNA polymerase II stalling. Finally, we identify high-confidence eSTRs that likely underlie previously identified genetic associations with complex phenotypes including schizophrenia and blood-related traits.