Predicted Biological Activity of Purchasable Chemical Space.
- Author(s): Irwin, John J
- Gaskins, Garrett
- Sterling, Teague
- Mysinger, Michael M
- Keiser, Michael J
- et al.
Published Web Locationhttps://doi.org/10.1021/acs.jcim.7b00316
Whereas 400 million distinct compounds are now purchasable within the span of a few weeks, the biological activities of most are unknown. To facilitate access to new chemistry for biology, we have combined the Similarity Ensemble Approach (SEA) with the maximum Tanimoto similarity to the nearest bioactive to predict activity for every commercially available molecule in ZINC. This method, which we label SEA+TC, outperforms both SEA and a naïve-Bayesian classifier via predictive performance on a 5-fold cross-validation of ChEMBL's bioactivity data set (version 21). Using this method, predictions for over 40% of compounds (>160 million) have either high significance (pSEA ≥ 40), high similarity (ECFP4MaxTc ≥ 0.4), or both, for one or more of 1382 targets well described by ligands in the literature. Using a further 1347 less-well-described targets, we predict activities for an additional 11 million compounds. To gauge whether these predictions are sensible, we investigate 75 predictions for 50 drugs lacking a binding affinity annotation in ChEMBL. The 535 million predictions for over 171 million compounds at 2629 targets are linked to purchasing information and evidence to support each prediction and are freely available via https://zinc15.docking.org and https://files.docking.org .