Humans have long possessed knowledge about the functions of proteins, even before the discovery of their sequences and structures. This ancient understanding is evident in various applications such as food, brewing, drugs, and more. However, it was not until the 20th century that biochemistry advanced significantly due to a deeper understanding of protein sequences and structures.
A widely accepted notion in the field is that a protein’s structure dictates its function. The precise orientation of amino acids, predetermined by the protein sequence, controls the biochemical properties of the protein. The study of protein function through sequence, structure, and biomolecular chemistry is known as protein function prediction, and the inverse problem in which the sequence is optimized for a target function and/or structure is named protein design.
The history of protein evolution is encoded in sequences, and this information can be extracted through conservation analysis. Position-specific scoring matrices (PSSMs) can be used as a proxy for evolutionarily critical functions, while sequence logos are commonly used to identify key functional residues. By overlaying consensus profiles onto structures, it is possible to gain a more comprehensive understanding of the relationship between sequence and structure, whereas, at the atomic level, domain experts identify functional residues, propose mechanisms for function, and verify hypothesesexperimentally based on their biochemical knowledge. All of these efforts are guided, at least in part, by human experts.
To automatically sample sequence and structure space in molecular modeling (MM), David Baker and his team introduced Rosetta. Rosetta was first developed for de novo structure prediction but has since expanded into homology modeling, protein design and docking. Like other molecular modeling techniques, Rosetta describes biomolecular interactions through score function. It captures the physical interactions between atoms and their statistics, and can be used to design for biological functions such as stability, enzymatic activity, protein-ligand affinity and protein-protein interaction. To refine the resolution further, machine learning (ML) models can be trained on energetics and other arbitrary features and deployed to model the functional landscape.
In contrast, without relying on external features, end-to-end models learn from sequence and/or structure directly. Protein language models (pLMs) identify patterns in protein sequences as universal sequence-to-sequence approximators, and are shown to be higher-order generalization of site-specific and pairwise conservation. The likelihood of recovering a masked amino acid in pLMs, similar to PSSMs, serves as an indicator of evolutionary preference and can link to protein functions. Language models trained on text completion can generate articles by predicting the next word iteratively, and likewise for protein sequences, pLMs are also shown to generate novel designs, sometimes conditioned on structures, species, family, E.C. numbers, and more.
This thesis touches on protein design problems in both the perspectives of molecular modeling and machine learning. The first chapter focuses on evaluating existing physics-based and machine-learning-based methods on predicting thermal stability, using beta-glucosidase as an independent case study. The second chapter dives into engineering RuBisCO oligomeric state through evolutionary analysis and molecular modeling in the context of protein-protein interface. The third chapter extends the investigation of protein-protein interactions, and presents a machine translation approach in pLM for designing the pairing partner when given either the heavy or light chain of an antibody. These findings demonstrate the possibility of protein design for thermal stability, oligomeric state and antibody engineering, and take a small step forward in potential applications in enzymatic manufacturing, carbon capture, and antibody therapeutics.