Computational and Machine Learning Methods for Understanding Gene Regulation and Variant Effects
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Computational and Machine Learning Methods for Understanding Gene Regulation and Variant Effects

Abstract

The field of genomics has been advancing at a fast pace ever since the development ofhigh-throughput sequencing technologies. While we have access to more data than ever before, the number of open questions has only increased. In this dissertation, I present novel machine learning techniques to draw insights from genomic data. First, I tackle the analysis of alternative splicing — a crucial but overlooked step in gene regulation — from short-read single-cell RNA-seq data. To account for the large scale and sparsity of such data, I develop scQuint, a suite of efficient probabilistic methods for dimensionality reduction and differential splicing. Next, I approach the problem of genome-wide variant effect prediction with a new direction: DNA language models. We first propose GPN, trained on unaligned genomes, and apply it to study genetic variants in Arabidopsis thaliana. GPN shows an improved power for highlighting variants under negative selection as well as those affecting traits. Furthermore, I show that GPN learns important genomic features such as gene annotations and transcription factor binding site motifs, without any supervision. We then present GPN-MSA, a DNA language model trained on whole-genome alignments of vertebrates, and showcase its excellent performance predicting deleteriousness across the entire human genome. These contributions not only pave the way for enhanced genomic understanding but also propose a methodological shift in genome analysis.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View