Large scale biological datasets are often comprised of observations which are noisy, whichare biased by environment or process, and which represent fragments of a perpetually grow-ing, yet incomplete record of human knowledge. Changes to computational methods, data deposition and storage, and improved collection of data have the potential to mitigate some of these problems. However, no one solution works for all problems, and care must be taken to ensure that a chosen method to make predictions for small molecules will be effective.
This thesis centers itself on prediction. How do we improve screening predictions made on biased and incomplete information? How do we better represent three-dimensional com-pounds to improve predictions when molecular shape is important? What methods might best inform our ability to make predictions now and improve the next step in the future?And how can we create testable hypotheses from phenotypic observations?
Chapter 1 presents the published work “Adding Stochastic Negative Examples into Ma-chine Learning Improves Molecular Bioactivity Prediction”. To address concerns over the effect that biased molecular affinity datasets may have on the accuracy of deep learning models, this work suggests an online method where to improve prediction when a dataset is made up of more binders than non-binders. The method, SNA, samples random, unannotated compounds and assigns them as non-binders during neural network training. SNA drastically improves the ability of the network to identify false positives in a full matrix of drugs and protein binders while slightly hurting performance on a time split.
Chapter 2 encompasses published work “A Simple Representation of Three-DimensionalMolecular Structure” which presents Extended Three-Dimensional Fingerprint (E3FP). This molecular fingerprinting technique generates a fingerprint that can represent three-dimensional structure for statistical and machine learning methods. Its advantages to two-dimensional fingerprints include the ability to encode structural relationships within a molecule and aggregation of fingerprints into a molecular ensembles. The E3FP was com-pared against existing two- and three-dimensional representations, and Chapter 2 shows some cases where the method outperformed these existing techniques.
Chapter 3 provides a brief commentary on the current outlook of deep learning for pre-diction of Adsorption, Distribution, Excretion, Metabolism, and Toxicity (ADMET). It de-scribes how changes to molecular representations of molecules for deep learning have improved prediction of ADMET endpoints. It speculates why techniques like neural network fine-tuning may be falling short. And, it pushes for deep learning interpretability and error estimation to improve trust in deep learning models and to facilitate iterative improvement of models.
Finally, unpublished work inChapter 4 focuses on how to use high throughput screening data to predict and rank a set of proteins to describe the clearance of free tau associated with applications for Alzheimer’s Disease.