Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Methodology development in medical and genomics data

Abstract

Data that quantify various aspects of medicine and biology are constantly improving and changing. As new techniques become available, opportunities arise to improve our understanding of diseases and other biological properties. With these ever changing data modalities new statistical techniques are required to do proper analysis. In this thesis we analyze, and develop new methods when necessary, for three types of biological data, bulk and single cell RNA-sequencing as well as magnetic resonance imaging (MRI).

First, we develop a statistic for analyzing high-throughput RNA-sequencing data in the context of drug discovery. Changes in gene expression between disease and normal tissues can be used to understand the genomic signature of a disease. When those diseased cells are exposed to a large number of drugs and other perturbations, we can systematically search for the perturbations that reverse the expression of the genes which are altered in the disease. In Chapter 2, we present a new method for quantifying this relationship between disease and drug that outperforms existing methods in simulation, decreases computation time, and is comparable in real data. Additionally, we show an improved ability to quantify the involvement of individual genes in effective drugs. An accompanying software package makes this method easily applicable to new data.

In the second study we extend an existing semi-supervised learning method called GeneFishing for application to single cell RNA-sequencing data (scRNA-seq). Single cell data has unique properties that make its analysis different from the bulk data used in Chapter 2. While direct application of GeneFishing was not always possible in scRNA-seq, due to dissimilar sources of variation and a marked increase in technical noise, the provided modifications allowed for the analysis to be possible. In addition to using GeneFishing as an effective gene prioritization method in single cell data, we show that it can be used as a way to understand how to measure gene-gene co-expression and as a context specific feature selection method for downstream analyses. Again, we present the accompanying software package, scGeneFishing, for performing GeneFishing in a variety of datasets, including scRNA-seq.

Finally in Chapter 4 of this dissertation we analyze a different type of medical data, brain images. We use MRI scans of patients with primary progressive multiple sclerosis to study the association of regional brain volume at baseline with disease progression. We find that statistics summarizing volume in pre-defined regions of interest (ROIs) are not more predictive of progression than using traditional clinical and MRI variables. However, by deriving data-driven ROIs through voxel-level clustering we are able to achieve better predictive performance.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View