Genome-wide association studies (GWAS) have identified numerous genetic variants linked to various biological traits. Notably, 95% of the most likely causal variants are found in noncoding regions of the genome. To explain the effects of these noncoding variants, a promising approach is to understand how they influence phenotypes by altering gene expression. This method, known as transcriptome-wide association study (TWAS), uses genetic models of gene expression to identify genes associated with different traits. Building predictive models of gene expression from genetic variants is a crucial first step in this process. PrediXcan, a regularized linear model that sums the additive effects of genetic variants, is a popular yet older method for developing such models.
In this thesis, we explore applications and potential improvements of these models of gene expression.
In the first chapter, we enhance PrediXcan by incorporating age as an additional predictor to quantify the gene expression variance explained by both genetics and aging across 27 tissues from 948 humans. We find consistent expression heritability (h^2) among tissues, while the contribution of aging varies significantly, with R^2(age) > h^2 in five tissues. Our analysis reveals tissue-specific evolutionary trends, while aging-associated genes show patterns in proliferative tissues aligning with high cancer rates and age-associated somatic mutations. Additionally, we find that genes highly influenced by genetics tend to be less constrained and less functionally important, whereas genes that have age-associated expression are more constrained and functionally important.
In the second chapter, we evaluate an alternative approach—genomic deep learning models— for predicting gene expression from genetics. While these models perform well in predicting expression variation across genes, they fall short in predicting expression variation across individuals, not matching the performance of the simpler PrediXcan model. We also investigate the potential of combining deep learning predictions with the regularized linear regression framework of PrediXcan to enhance prediction accuracy by leveraging strengths from both approaches.
The final chapter focuses on association studies within a new, diverse cohort of people living with HIV (PLWH). We hypothesize a link between increased inflammation biomarkers activated by the NLRP3 inflammasome pathway (IL-1β, IL-6, and IL-18) and cardiovascular disease (CVD) risk within PLWH. Using whole genome sequencing data from 1001 PLWH, we identify significant GWAS signals implicating genes involved in immune function, cardiovascular function and response to HIV. We find three GTEx eQTLs matching the GWAS signal regions, suggesting a mechanism for increased CVD risk through inflammation in coronary artery tissue. Utilizing PrediXcan, we conduct a baseline transcriptome-wide association study (TWAS), providing a testbed for future model improvements, particularly in enhancing generalizability to different genetic populations, as observed in the second chapter. Together, these studies highlight the utility of gene expression models in disentangling the roles of genetics, aging, and inflammation in disease risk, and demonstrate the potential for integrating advanced modeling techniques to improve predictive accuracy and generalizability across diverse populations.