Learning Quantitative Sequence-Function Relationships using Massively Parallel Reporter Assays
- Author(s): Insigne, Kimberly Danielle
- Advisor(s): Kosuri, Sriram
- et al.
The field of genomics has grown rapidly over the past decade due to the advent of high throughput sequencing technologies. Genomics relies on this wealth of information to draw biological inferences, but using inference to establish causality can be challenging as many
genetic factors correlate with one another. Due to the declining cost of both reading and writing DNA, new techniques known as massively parallel reporter assays (MPRAs) provide the ability to test the function of a large library of tens to hundreds of thousands of designed DNA sequences simultaneously in a single experiment. Testing designed libraries allows us to explore beyond natural sequence variation to directly test thousands of sequence-function hypotheses simultaneously. In this dissertation I discuss two projects that explore sequence-function relationships in different biological systems.
The first project is focused on how human genetic variation affects exon recognition, as mis-splicing is a major mechanism through which variants exert their influence. We developed a Multiplexed Functional Assay of Splicing using Sort-seq (MFASS) and assayed 27,333 variants in the Exome Aggregation Consortium within or adjacent to 2,198 human exons. We found that 3.8% (1,050) led to large splicing disruptions, many of which are extremely rare, located outside of canonical splice sites, distributed evenly across intronic and exonic regions, and are difficult to predict. MFASS enables direct functional measurement of large-effect splicing defects at scale.
The second project is focused on promoters and transcriptional regulation in Escherichia coli. Promoter sequence space in bacteria is vast and difficult to study genome-wide due to external factors that influence transcription. We developed a genomically-encoded MPRA to characterize the global promoter landscape and dissect active promoters for regulatory motifs. We measure promoter activity of over 300,000 sequences spanning the entire genome and identify 3,321 active promoter regions in glucose minimal media and 3,477 in rich LB media. Furthermore, we perform a scanning mutagenesis of 2,057 E. coli promoters to identify regulatory sequences. Lastly, we implement a variety of machine learning models to classify promoters and quantitatively predict their activity. We present a series of approaches to rapidly characterize promoter sequences within the E. coli genome.