Limiting Biases in Biological Data Analysis by Pooling Information
- Author(s): Bhutani, Kunal;
- Advisor(s): Schork, Nicholas J;
- Bafna, Vineet
- et al.
Innovations in the design and implementation of high-throughput technologies has shifted biological research from hypothesis-driven inquiries to large data-driven studies. Scientists can now jointly interrogate the genome, transcriptome, metabolome, microbiome, and dozens of other molecular systems to develop more complete, interconnected pictures of biological states. However, accurate interpretation of each state requires a thorough understanding of the sources of variation associated with the underlying assays and experimental approaches used. Here, I pool information from related sources based on known generative processes to model variation and limit biases in the analysis of three different biological phenomenon. First, I discuss jointly identifying genomic variants in induced pluripotent stem cells derived from the same fibroblast population to assess the mutational burden of three different reprogramming methods: retroviral transfection, Sendai virus, and non-integrating mRNA. The research suggests that each method induces new mutations, but there are no obvious systematic differences in the types of mutations nor in the genomic regions harboring them. Shifting to transcriptomics, I next model uncertainty and variation in imputed expression in transcriptome-wide association studies. I show through simulations that a novel Bayesian method that pools multiple models of transcription regulation outperforms current methodologies in identifying associations between imputed gene expression and a phenotype. In an application to seven diseases from the Wellcome Trust Case Control Consortium, the method finds 42 associations, 17 of which have not yet been previously identified by GWAS or differential gene expression analyses in case-control cohorts. Finally, I describe results from a study exploring longitudinal profiles of the metabolome, microbiome, and transcriptome of a young female germline TP53 mutation carrier. The motivation for this study was to determine if any health status changes might occur in this carrier that could be indicative of tumor formation given her extremely high cancer susceptibility. I utilize a Bayesian model to separate metabolite variation from instrumentation variation by calculating latent metabolite levels across multiple instrumentation runs. Fortunately, I do not find obvious and statistically deviations from baseline for any biomarker indicate of cancer, but I highlight power limitations in such study designs. Together, these three works demonstrate the importance and utility of pooling information to limit biases in contemporary high-throughput, data intensive biological analyses.