The proliferation of and continual improvement in sequencing technologies over the last three decades have vastly increased the rate of biological data generation, while concurrently reducing cost per base sequenced. However, in many ways, statistical and computational strategies for extracting meaningful biological insights from this unprecedented wealth of data have not kept pace with these molecular advances. For instance, biological features such as genomic structural variants (SVs) can remain challenging to capture and evaluate with the most common types of sequencing, despite their implication in complex traits including adaptive phenotypes and disease predisposition. To address these unmet needs in this rapidly changing scientific landscape, we developed novel algorithms and applied cutting-edge computational and quantitative techniques to a diverse set of scientific questions.
First, we developed a method called FREQ-Seq², along with associated algorithms and software, to enable the rapid and targetable study of allele frequencies, while addressing the challenges inherent to traditional techniques, such as throughput and scalability. We applied our approach to studying evolutionary genetics in E. coli, demonstrating its accuracy and precision in characterizing population dynamics and genotype distributions. We also contributed a valuable resource to the scientific community by presenting a new reference-quality genome assembly for the model organism Drosophila melanogaster based on long-read sequencing, along with a thorough investigation of hidden genetic variation and evolutionary insights that we uncovered. For example, we discovered previously unidentified SVs potentially linked to complex traits, such as a duplication associated with nicotine resistance, as well as novel understanding about the cosmopolitan distribution of SVs in Drosophila. Additionally, we leveraged advancements in machine learning to further explore SVs, training a deep convolutional neural network-based model to identify SVs with high precision using widely available short-read Drosophila datasets. Finally, we demonstrated the utility of applying computer vision and machine learning techniques to ongoing initiatives to digitize herbaria catalogs, by developing algorithms and a pipeline to extract phenological and ecological data from digitized specimen images, facilitating rapid and accurate labeling and annotation as well as expanding ease of end-user access to these resources. Taken as a whole, this thesis provides a comprehensive study of computational and quantitative approaches and their application to wide-ranging challenges in genetics, ecology, and evolutionary biology.