Path Sampling and Machine Learning Approaches to Biomolecular Simulation
- Ray, Dhiman
- Advisor(s): Andricioaei, Ioan
This thesis describes the development and application of advanced computational methods for studying biomolecular processes using atomistic molecular dynamics (MD) simulations. Due to the presence of high free energy barriers, many interesting biological phenomena, such as protein-ligand binding and protein conformational change, take place at timescales that are well beyond the capacity of present-day supercomputers; these processes are therefore referred to as rare events. Two alternative ways of addressing this problem are outlined here. The first approach is the development of a novel path sampling strategy called Weighted Ensemble Milestoning (WEM) and its improved form, Markovian Weighted Ensemble Milestoning (M-WEM). These combined sampling methods are tested on model rare-event problems and also to study the unbinding of a ligand from a protein, a system that has potential implications in computer-aided drug design. Both of them could calculate the free energy profile and the kinetics of long-timescale processes in agreement with the experiment or standard MD simulation, but at orders of magnitude lower computational cost. The second approach is the use of unsupervised machine learning techniques, such as time-lagged independent component analysis (tICA) and linear mutual information (LMI) to identify the slowest degrees of freedom and the allosteric communication pathways in biomolecules. In the first example, a Markov state model (MSM) is constructed based on tICA coordinates, for Watson-Crick to Hoogsteen base pairing transition in nucleic acids, a process responsible for DNA repair and replication. From the tICA-MSM model, the underlying free energy landscape and the kinetics of the conformational switching process could be predicted in agreement with the experimental results. Furthermore, in the system of the coronavirus spike protein, analyzing the coupling of the protein backbone torsion angles with the tICA coordinate, the most important amino acid residues impacting the conformational change can be identified, and future mutations could be predicted. Using this technique, multiple new mutations are predicted for the wild-type SARS-CoV-2 spike protein, out of which two mutations have been observed in the highly contagious new variants of the virus. On the other hand, the mechanism of antigen-antibody recognition and immune evasion due to mutations in the spike protein can be demonstrated by analyzing the allosteric communication pathways using LMI based inter-residue cross-correlation. Thus, the path sampling and machine learning methods, described in this dissertation, can facilitate the study of complex biomolecular systems in quantitative detail and potentially find application in the rational design of therapeutic agents.