Biological sequence analyses with interpretable linear models
- Mok, Amanda
- Advisor(s): Lareau, Liana
Abstract
Biological data are highly complex, reflecting not only the complexities of biological processes but also the multidimensionality of biological measurements. Biological sequences are a major class of biological data, and are often used both as predictors and reporters of biological function. Understanding the sequence structure in biological data is therefore critical to probing the mechanisms by which biological sequences govern biological functions. Computational models for this purpose should not only be highly accurate but also interpretable. In this work, we describe two approaches to understanding the sequence determinants of biological function using linear models. We first estimated sequence-based technical biases present in ribosome profiling data, and developed a computational bias correction method to mitigate the effects of these technical biases on ribosome footprint counts. The interpretability of model parameters enabled the generation of bias correction factors that directly quantified sequence-dependent effects. We next explored guide design principles through a high-throughput screen of Cas13a guides in a trans-cleavage RNA detection assay, leading to the development of a bioinformatics pipeline for rational guide design. These two studies highlight the utility of linear models for interpretable biological sequence analyses.