Understanding disease through data driven biology
Over the past five years, rapid technological advancement has allowed for a surge in data generation, allowing for unbiased genetic and epigenetic profiling of healthy and diseased individuals. Given these unprecedented advances in data generation, the key question arises of what we can learn from such rich datasets that we didn't know via other means. Here, I take a data-first approach to tackle three different problems in which large data-collection efforts combined with novel analytic methods can shine new light on the biology of disease. In the first part, I exploit the comprehensive, multi-omics profiling provided by The Cancer Genome Atlas to conduct an analysis of the molecular and clinical features of head and neck squamous cell carcinoma (HNSCC) that govern patient survival. I find that among HNSCC tumors TP53 mutation is frequently accompanied by loss of chromosome 3p and that the combination of these events is associated with a surprising decrease in survival time. Continuing with analysis of TCGA samples, I then analyze a large pan-cancer set of patients with both tumor and adjacent normal tissue samples profiled. By observing shared transcriptomic and epigenetic changes across a large and diverse set of tumors, this analysis identifies those shared signals that are likely to be important for both the onset and progression of cancer cells. Finally I use genome-wide epigenetic profiles to develop and validate epigenetic models of human aging in whole blood and purified blood cells to quantify the impact of HIV infection on aging. This work finds that both chronic and recent HIV infection lead to an average aging advancement of 4.9 years, increasing expected mortality risk by 19%. Taken together these studies all explore new biological findings, while providing examples of the power data-driven analysis to aid in the understanding of biology and disease.