Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Overcoming data privacy and data gravity challenges in bioinformatics research

Abstract

Next-generation sequencing technologies have generated a massive amount of DNA, RNA, and protein sequences since their inception. However, data privacy policies often restrict sharing such data for the risk of re-identifying individuals from whom the sequences were generated. Even when all the data from a sequencing experiment is available, it is often insufficient for statistical power or training machine learning models. Despite the lack of data, sometimes the data sets are ironically too large to realistically share with researchers. In this thesis, I explore methods to overcome challenges of data privacy and data gravity in bioinformatics research.

In collaboration with QIMR Berghofer and the Riken Center for Integrative Medical Sciences, we used federated methods to analyze genomic data from the BioBank Japan in situ to classify variants of uncertain significance while preserving privacy. With the Department of Laboratory Medicine and Pathology at the University of Washington, we developed a statistical model that demonstrates how using responsibly shared clinical evidence alone can classify variants of uncertain significance which occur at the rate of 1 in 100,000 people within just a few years. With researchers from McGill University, we reviewed the state of the art in federated computing technologies and how well they satisfy the privacy restrictions from the General Data Protection Regulation. With researchers from NASA, Amazon, and Intel, we developed a federated learning framework to run between terrestrial and space-borne compute infrastructure, laying the groundwork for subsequent experiments which preclude the need to transfer large datasets across astronomical distances. Finally, at NASA, we used a causal inference machine learning ensemble to infer robust correlation between mouse liver gene expression and a corresponding lipid density phenotype in space-flown mice.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View