We currently live in a state of perpetual genomic epistemic uncertainty, unaware globally of the distinct predispositions and liabilities driven by our underlying encoding, particularly for complex diseases. The immense dimensionality intrinsic to genetic data, coupled with limited availability of large, comprehensive datasets, introduces practical barriers hindering a complete understanding of genotype-to-phenotype relationships. Moreover, complex phenotypes present compounding challenges, given their polygenic nature and the contributions of environmental factors.
The advent of increasingly affordable genomic profiling technologies has led to a proliferation of genomic datasets, offering new avenues to explore the genetic basis of complex diseases. Genome-wide association studies (GWASs) and polygenic risk scores (PRSs) have been instrumental in identifying and quantifying variant contributions to disease susceptibility. PRSs, though, despite their utility, are typically formulated in an additive manner that assumes conditional independence among variants, neglecting potential epistatic interactions that may impact phenotypic outcomes. Additionally, since PRSs largely rely on single nucleotide polymorphism (SNP) level data, they may fail to capture the influence of complex variants and highly polymorphic regions not adequately represented at this resolution.
My dissertation encompasses a multifaceted approach to address these limitations. First, to both enable the formation of large high-quality composite genomic datasets from individual studies and improve cross-dataset analysis portability, I developed GRIEVOUS, a tool for genomic data harmonization that resolves artifactual noise caused by differences in variant indexing and allele assignments across datasets. Second, I introduce VADEr, a Transformer-based architecturethat combines techniques from both natural language processing and computer vision to generate genomic risk scores and reveal genetic heterogeneity, demonstrated in a proof-of-concept study in prostate cancer. Finally, I investigate the influence of highly polymorphic regions on complex disease risk by assessing the role of specific HLA genes in melanoma susceptibility. Overall, this work contributes new tools, methods, and insights that advance the detection of genetic contributions to complex diseases.