Skip to main content
eScholarship
Open Access Publications from the University of California

Tools for Large-scale Genomic Analysis and Gene Expression Outlier Modeling for Precision Therapeutics

  • Author(s): Vivian, John
  • Advisor(s): Haussler, David
  • Paten, Benedict
  • et al.
Creative Commons Attribution-ShareAlike 4.0 International Public License
Abstract

In terms of data acquisition, storage, and distribution, genomics data will soon become the largest “big data” domain in science and, as such, needs appropriate tools to process the ever-increasing amount of genomic data so researchers can leverage the power afforded by such enormous datasets. I present my work on Toil: a portable, open-source workflow software that supports contemporary workflow definition languages and can securely and reproducibly run scientific workflows efficiently at large-scale. Yet efficient computation is only one component of enabling scientific research, as data is not always accessible to researchers who can use it. Data barriers hinder scientific progress and stymie research collaboration by denying access to large amounts of biomedical information, due to the need for patient privacy and potential liability on behalf of data stewards. As such, research institutions and consortiums should prioritize making large datasets open-access to enable research teams to develop novel therapeutics and garner valuable insight into a wide variety of diseases. One such research group who benefits from both large open-access datasets is Treehouse, a pediatric cancer research group that investigates the role of RNA-seq in therapeutics. However, Treehouse also needs methods to extract rare pediatric cancer data from information silos. Treehouse uses RNA-seq to identify target drug candidates by comparing gene expression for individual patients to their own public compendium, which combines multiple open-access datasets with thousands of pediatric samples. I discuss a solution for extracting data from information silos by using portable and reproducible software that produces anonymized secondary output that can be sent back to the researcher for analysis. This computation-to-data method also addresses the logistical difficulty of securely sharing and storing large amounts of primary sequence data. Finally, I propose a robust Bayesian statistical framework for detecting gene expression outliers in single samples that leverages all available data to produce a consensus background distribution for each gene of interest without requiring the researcher to manually select a comparison set and provides posterior predictive p-values to quantify over- or under-expression.

Main Content
Current View