The field of genomics has been advancing at a fast pace ever since the development ofhigh-throughput sequencing technologies. While we have access to more data than ever
before, the number of open questions has only increased. In this dissertation, I present novel
machine learning techniques to draw insights from genomic data. First, I tackle the analysis
of alternative splicing — a crucial but overlooked step in gene regulation — from short-read
single-cell RNA-seq data. To account for the large scale and sparsity of such data, I develop
scQuint, a suite of efficient probabilistic methods for dimensionality reduction and differential
splicing. Next, I approach the problem of genome-wide variant effect prediction with a new
direction: DNA language models. We first propose GPN, trained on unaligned genomes,
and apply it to study genetic variants in Arabidopsis thaliana. GPN shows an improved
power for highlighting variants under negative selection as well as those affecting traits.
Furthermore, I show that GPN learns important genomic features such as gene annotations
and transcription factor binding site motifs, without any supervision. We then present
GPN-MSA, a DNA language model trained on whole-genome alignments of vertebrates,
and showcase its excellent performance predicting deleteriousness across the entire human
genome. These contributions not only pave the way for enhanced genomic understanding
but also propose a methodological shift in genome analysis.