- Main
Geometric Learning for Quantum-Informed, Machine Learning and Analysis of Electrostatic Preorganization
- Vargas, Santiago
- Advisor(s): Alexandrova, Anastassia N
Abstract
This thesis is organized in a slightly unconventional fashion: algorithms lead and appli-cations fill out the content. I think this emphasizes my interests during graduate school - I built algorithms and tools to address issues that were otherwise inaccessible to different areas of computational chemistry (including applied machine learning) and enzymology. Two sets of scientific thrusts underscore the bulk of my work: algorithms to analyze dynamic, heterogeneous fields in the context of enzymology and flexible machine learning algorithms, including those that leverage quantum descriptors, for rigorous molecular and reaction-level properties. Each section will include grounding on applications and broader impacts for the reader as well. Now we pivot to discussing the main thrusts and outlining each chapter briefly.
General ML and Quantum Theory of Atoms-in-Molecules (QTAIM): QTAIMserves as a mathematical decomposition algorithm for electronic basins within a molecule. The algorithm intakes molecular densities, as computed (typically) by density functional theory (DFT), and uses the flux of density to partition the scalar field into 3-dimensional atomic basins of density [14, 16]. These objects are known as atomic basins and represent the quantum atom within a molecule. By constructing these structures, we compute a rich set of mathematical descriptors that map to many features including energies, bonding, and electron delocalization. These features have been correlated, in the past, to activation energies, reactivity, and overall system energies, but these uses largely relied on human intervention and small datasets [44, 62, 65, 111, 142, 287]. By developing software centered around high-throughput QTAIM calculations and machine learning, I was able to bring these descriptors to larger datasets and a wide host of applications. In Chapter 2, I discuss an algorithm I implemented to predict Diels-Alder reaction barriers from QTAIM signatures alone. In this study, we showed that QTAIM features, can be used to surmise reaction barriers while also using machine learning techniques to understand what signatures were most informative to our models. Here QTAIM electrostatic potentials and delocalization indices alone were able to yield great performance on withheld datasets. In addition, we demonstrated that QTAIM features can allow a machine learning model to generalize, to an extent, to much larger Diels-Alder reactions. This chapter was adapted from the following: Machine Learning to Predict Diels–Alder Reaction Barriers from the Reactant State Electron Density. S. Vargas*, M. Hannefarth, Z. Liu, A.N. Alexandrova. Journal of Chemical Theory and Computation 2021 17 (10), 6203-6213. 10.1021/acs.jctc.1c00623. In Chapter 3, I discuss a package developed to perform high-throughput QTAIM calculations on datasets of molecules and reactions. This package is currently adapted to work with open-source packages such as ORCA and Multiwfn. These softwares, respectively, compute DFT densities at a user-specified level of theory and subsequently compute QTAIM descriptors. The package is built with high-performance compute (HPC) in mind as it can operate on a single dataset with an arbitrary number of concurrent jobs. Here I also used the package to compute QTAIM values for a diverse set of important and difficult datasets and developed graph neural networks to predict molecular and reaction properties leveraging QTAIM as inputs. This chapter was adapted from the following: This was adapted from High-throughput quantum theory of atoms in molecules (QTAIM) for geometric deep learning of molecular and reaction properties Santiago Vargas, Winston Gee, and Anastassia N. Alexandrova. Digital Discovery 2024 3, 987-998.
Advancing Analysis of Electric Fields in Proteins: The later chapters follow ourwork in developing algorithms to ingest, interpret, and predict on electric fields in protein active sites. This work builds on the notion of electrostatic preorganization, a theory that posits that protein scaffolds arrange to electrostatically catalyse chemical reactions, and thereby, destabilizing reactants while suppressing transition state energies [299, 301]. Chapter 4 depicts exhaustive efforts to apply heterogenous electric field analysis to understanding directed evolution in the context of a protoglobin directed evolution (DE) trajectory. Previous DE efforts optimized protoglobin to efficiently catalyze carbene transfer reactions. We show that traditional explanations for increased catalytic activity across the DE lineage, substrate access and binding, cannot account for the dramatic improvements in protein activity. By tracking the 3-D electric field and using clustering algorithms, we pinpoint representative structures for QM/MM calculations and show that changes in the electric field, along DE, improve carbene transfer reactivity. These findings highlight the role electrostatic organization, notably its dynamic effect, has on determining protein function and points to its future importance in designing proteins for relevant chemical processes. This chapter is adapted from Directed Evolution of Protoglobin Optimizes the Enzyme Electric Field. Shobhit S. Chaturvedi, Santiago Vargas, Pujan Ajmera, and Anastassia N. Alexandrova. Journal of the American Chemical Society 2024 146 (24), 16670-16680 DOI: 10.1021/jacs.4c03914. In Chapter 5, I introduce a machine learning framework designed to predict enzyme functionality directly from the heterogeneous electric fields applied to protein active sites. We apply this method to a dataset of Heme-Iron Oxidoreductases. Previous studies here, focused on simple, point electric fields along the Fe-O bond, are insufficient for reasonable accuracy. On the otherhand, our 3-D, heterogenous model can accurately predict protein activity without relying on additional protein-specific information. In addition, feature selection elucidates what electric field components most inform our models and thus highlight important components to reactivity and selectivity. Finally, we apply previously-mentioned electric field clustering algorithms and QM/MM calculations to reveal how dynamic complexities in protein structures can complicate predictions and thus provides a path forward for improved models in this space. This chapter is adapted from Machine-learning prediction of protein function from the portrait of its intramolecular electric field. S. Vargas*, S. Chaturvedi, A.N. Alexandrova. (Accepted, Journal of the American Chemical Society)
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-