The solvation free energy of organic molecules is a critical parameter in
determining emergent properties such as solubility, liquid-phase equilibrium
constants, and pKa and redox potentials in an organic redox flow battery. In
this work, we present a machine learning (ML) model that can learn and predict
the aqueous solvation free energy of an organic molecule using Gaussian process
regression method based on a new molecular graph kernel. To investigate the
performance of the ML model on electrostatic interaction, the nonpolar
interaction contribution of solvent and the conformational entropy of solute in
solvation free energy, three data sets with implicit or explicit water solvent
models, and contribution of conformational entropy of solute are tested. We
demonstrate that our ML model can predict the solvation free energy of
molecules at chemical accuracy with a mean absolute error of less than 1
kcal/mol for subsets of the QM9 dataset and the Freesolv database. To solve the
general data scarcity problem for a graph-based ML model, we propose a
dimension reduction algorithm based on the distance between molecular graphs,
which can be used to examine the diversity of the molecular data set. It
provides a promising way to build a minimum training set to improve prediction
for certain test sets where the space of molecular structures is predetermined.