In drug discovery, where a model of the protein structure is known, molecular docking is a well-established approach for predictive modeling. Docking algorithms utilize a search strategy for exploring ligand poses within an active site and a scoring function for evaluating the poses. This dissertation explores improvements to both aspects of docking, emphasizing the use of machine learning methods for improving scoring functions. The work is built upon an extensible software platform for modeling molecular interactions, called Surflex.
Performance evaluation has been carried out on benchmarks that have been made publicly available, some of which were constructed in the course of this work. The novel tool pdbgrind, developed as part of the infrastructure for this work, was used to generate the large amount of data necessary to create adequate training and test sets. While the dissertation focuses most strongly on the scoring function problem in docking, some effort was also spent on the tightly coupled problem of search, and modest improvements were shown by enhancing Surflex's representation of protein active sites.
The bulk of the work describes improvements to empirical scoring functions for protein-ligand interactions. This dissertation demonstrates a robust method for tuning scoring function parameters to improve modeling of known binding phenomena. Penalties for inter-atomic overlap and same-charge repulsion were learned using synthetic negative data. The new function was shown to be equivalent or better than the original function in terms of screening utility on a large and diverse benchmark. This approach was generalized for the entire scoring function to support the use of multiple constraints in refining scoring function parameters. Using the constraint-based optimization procedure, users can exploit multiple types of data to customize functions to suit a particular task or a particular protein target or family of targets. Significant improvement to screening utility was shown using data typical of applications in docking.
The main contributions of this dissertation are generalizable methods for generating and exploiting multiple types of data in refining scoring functions for docking. The approaches can be extended to other areas, including quantitative structure activity prediction or protein folding.