Proteins serve many crucial functions in maintaining life, but have also been co-opted for human endeavors such as gene editing, immunotherapy, and plastic degradation. To better serve such needs, we re-engineer naturally existing proteins or design de novo proteins. In recent years, data at an unprecedented scale from evolution, 3D structures, and experiments became available for learning to design and engineer proteins. These distinct data types provide different yet complementary information about proteins. This thesis presents new machine learning methods that learn from multiple types of data for the problems of sequence-based protein fitness prediction and structure-based fixed backbone protein design.
For sequence-based protein fitness prediction, machine learning-based models typically learn from either unlabelled, evolutionarily-related sequences, or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, combining both sources of information could improve protein fitness prediction. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. The comparative analysis also highlights the importance of systematic evaluations and sufficient baselines.
For structure-based fixed backbone protein design, prior machine learning approaches to this problem have been limited by the number of available experimentally determined protein structures. We present a strategy to augment the training data by nearly three orders of magnitude by predicting millions of structures using AlphaFold2. Graph neural network and Transformer models trained with this additional data achieves an overall improvement of almost 10 percentage points over existing methods in native sequence recovery rate. We also study the generalization to a variety of more complex tasks including design of protein complexes, partially masked structures, binding interfaces, and multiple states.