Search

Scholarly Works (1 results)

Thesis
Peer Reviewed

From proteins, to machines, to protons, to genes, and back again

Fraga, Keith Jeffrey
Advisor(s): Korf, Ian F

UC Davis Electronic Theses and Dissertations (2022)

The success of data standards and public databases in biology is the foundation for the current and continued success of machine learning in biology and medicine. This dissertation explores the interactions between biology, computers, and people in order to develop novel machine learning methods to model complex biological problems. Data is one of the main resources to do machine learning, and Chapters 1, 2, 3 are explicitly about data organization and quality assurance in the protein Nuclear Magnetic Resonance (NMR) spectroscopy discipline. Chapters 4 and 5 present new machine learning architectures to address learning tasks in genomic site recognition and NMR chemical shift prediction. Chapter 1 investigates the manner protein NMR chemical shift data is deposited at the Biological Magnetic Resonance Bank (BMRB) in order to build simple table look-up models to estimate protein chemical shifts. In Chapter 1, we find there is low sequence diversity and data redundancy in the BMRB that was a challenge to locate and filter out. Without filtering out BMRB entries with the same sequence, and possibly the same chemical shifts, look-up models will be more accurate due to data contamination in training and testing sets. Chapter 2 examines approaches to curate a large protein sample production and NMR database to create an NMR time-domain dataset. Quality assurance tests in this NMR sample/FID database uncovered data collisions and redundancies among the database records, which motivated the development of new NMR database management tools. Chapter 3 presents a relational database schema to archive protein NMR samples and associated time-domain data called SpecDB. SpecDB is open source and available at https://github.rpi.edu/RPIBioinformatics/SpecDB.git. Chapter 4 explores how deep neural networks can recognize genomic splice acceptor and donor sites from sequence alone, achieving 97% accuracy for highly used splice donor sites. Chapter 4 also investigates neural networks for intron/exon sequence classification, maximally reaching 77% accuracy. Chapter 5 presents the application of marginalized graph kernels to prediction of NMR chemical shifts for small organic molecules. Incorporating chemical descriptors to graph kernels reaches a 3.501 ppm mean absolute error for Carbon chemical shifts. In total, the following five dissertation chapters explore work in data integrity, organization, and learning techniques from data for applications to structural biology problems.

Cover page: From proteins, to machines, to protons, to genes, and back again