Though the title of my thesis infers a unifying theme via the application of machine learning, the two projects that form the bulk of my graduate degree are frankly more disparate than they are similar. Both endeavors provide novel methods to a field where ground truth is obscure and/or limited, and both apply machine learning techniques in their methodologies. Those similarities notwithstanding, the scientific domains, technical applications, experimental designs, and overall goals remain independent. While having a thesis comprised from two independent parts may not be conventional, this is, as they say, not a bug but a feature. Working within (and occasionally across) two research domains has helped me to acquire a diverse skillset and has provided me with a better and broader understanding of machine learning practices for scientific research.
As this thesis is composed of two linked, but distinct projects, the abstract (and chapters) is divided in two. The first section details work related to large-scale predictions of purchasable chemical space, and the second summarizes a novel method for automating diagnosis of melanocytic atypia in human histopathological samples.
I. Large-Scale Predictions for Purchasable Chemical SpaceThere are now over 400 million compounds one can easily purchase from the ZINC database (zinc.docking.org). About 350 million (85%) of these compounds are affordable enough for the average academic lab to conduct a ligand discovery project. However, the molecular targets (proteins) that these purchasable compounds bind and modulate—if any—are rarely known. Fewer than 1 million compounds (<0.25%) have been reported active in a target-specific assay according to public databases such as ChEMBL. In the absence of target activity information, the process of selecting compounds for general purpose screening will often be target-naïve.
To facilitate access to new chemistry for biology, my collaborator John Irwin and I generated predictions for all purchasable compounds in ZINC at the time. I explored methods for optimizing predictive performance of compound-target associations using ChEMBL’s bioactivity dataset (version 21) as a benchmark. Comparisons on cross-validation sets of the bioactivity dataset against several methods such as multinomial naïve-bayesion classifiers revealed that the combination of the Similarity Ensemble Approach (SEA) with the maximum Tanimoto similarity to the nearest bioactive yielded the best performance. I verified the utility for several of these predictions, quantified target prediction biases inherent to the dataset, and provided thresholding suggestions to the user for controlling sensitivity and specificity of the predictions, as well as novelty of target-associations allowed.
II. Automating Diagnosis of Melanocytic Atypia: A Precursor to Melanoma in SituMelanocytic atypia, a biological precursor to Melanoma, is histopathologically challenging. Pathologist interobserver agreement for melanocytic atypia in standard (H&E) histology images is low, ranging from 33-68%, with melanoma in situ (MIS) in particular contributing to diagnostic discordance. A lack of agreement among experts presents a challenge to any supervised learning task, where the utility of a learned function depends on the accuracy and reliability of labels used.
To circumvent the issue of pathological discordance in labeling, I paired H&E histology images with contiguously cut tissue sections, immunohistochemically (IHC) stained for melanocytes. I developed a deep-learning pipeline for automating diagnosis of melanocytic atypia using this custom dataset of paired, whole slide images (WSIs) and trained convolutional neural networks to identify the presence of melanocytes in H&E sections, using information solely from paired images. Networks achieved strong performance on holdout patient datasets. For each network trained, I generated full-scale (20X magnification) high-resolution (pixel-wise) prediction heatmaps on holdout tissue sections (H&E), for pathological interpretation, and applied saliency mapping to show what networks attend to in H&E images. This pipeline aims to provide assistance to the clinical pathologist to reach better consensus regarding new MIS diagnoses in cutaneous biopsies.