Entropy is a fundamental concept in science. It describes the disorder, randomness, and uncertainty of a physical, biological, or social system. While understanding entropy has far-reaching impact to advancing our knowledge in many scientific areas and our society, the development of rigorous theories and computational technologies for entropy is a rather challenging task due to the vast complexity of an underlying system. In the context of biological molecules such as proteins and DNAs, entropy as defined in statistical mechanics and thermodynamics is a critical part of the total free energy of such molecules in a chemical environment. Efficient and accurate calculations of such entropy is of particular interest as the calculation of free energy, which is fundamental to physical and biological processes, is known to be notoriously difficult. The need and recent interest in advanced computational methods for entropy in biological molecular systems have motivated directly this dissertation work.
The basic mathematical and statistical definition of entropy, the Shannon entropy in information theory, for a random variable in an Euclidean space is the negative expectation of the natural logarithm of the probability density function (PDF) for the random variable. The entropy of a physical and biological system can be written in the form of, or approximated by, the Shannon entropy with a suitably defined PDF that has physical meanings. Practically, the dimension of an underlying random variable can be very high, and in addition, its PDF may not be known. The goal of my study is to develop efficient and accurate computational methods for the Shannon entropy with the application to the calculation of entropy of a particle system that may consist of many particles, forming a liquid.
In this dissertation work, I begin with a formal derivation of a class of nonparametric kNN-type estimators of the entropy, including the classical kNN estimator, the kpN estimator introduced recently by some physicists, and a new estimator called kp-kernel estimator that I have constructed. One of my objectives here is to understand if theses estimators can better capture some properties that are related to singular behaviors of an underlying PDF, such as the ``tail'' part of the PDF. My extensive numerical simulations using these estimators with several different PDFs show some of such advanced features. These include a better description of a strongly correlated system and more accurate sampling of the tail part of a given distribution. I will then present a convergence analysis to show that some of these estimators converge in expectation, under some realistic assumptions.
Subsequently, I apply these kNN-type entropy estimators to calculate the entropy of simple molecular systems. Here a statistical mechanics theory of simple liquids is invoked, and the entropy is expressed in a series of terms, each is a Shannon entropy, where the first two terms are known to be the most important. I implement the Markov chain Monte Carlo method to sample an underlying molecular system, and then I use the kNN and kpN methods to estimate the entropy.
Finally, I present my related work on the molecular dynamics (MD) simulations of the solvation of an ion in water. Using the radial distribution function of water molecules surrounding an ion, obtained from the MD simulations, I find the effective radius of the ion. I also compare the results of the MD simulations with those of a stochastic ordinary differential equation (SODE) model to examine the validity of such an SODE approach. The work presented here is a first step toward combining statistical methods and computational analysis to tackle one of the very complex problems in mathematical modeling and computer simulations of biological molecules.
My detailed studies of a class of nonparametric entropy estimators and their application to molecular modeling demonstrate that these methods are promising. More work remains to improve the efficiency of some of these estimators, and to develop a complete theory of convergence. Further theories and more related methods are needed for better applications of these estimators in molecular modeling.