Surrogate and Iterative Machine Learning Methods for Modelling Chemical Phenomena
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Surrogate and Iterative Machine Learning Methods for Modelling Chemical Phenomena

Abstract

Modelling soil properties has important implications both for soil remediation and preventing misapplication of fertilizer in large-scale farming settings. By better understanding the dynamics of radioactive cesium infiltration and binding in clays, remediation strategies can be designed to lessen the long-term impact of radioactive particles on the environment and society. Similarly, adverse environmental effects associated with fertilizer runoff such as toxic algal blooms can be mitigated by precisely modelling soil nutrient concentrations and quantitatively predicting the economic effect of fertilizer application. Roadblocks to modelling microscale ion diffusion in bulk clay include the relatively long timescale of cesium-potassium ion exchange as well as the excessive computational cost associated with modelling all-atom systems; in particular, explicitly modelling hydrogen atoms drastically reduces the minimum simulation timestep. Modelling crop yield as a function of the spatial distribution of soil nutrients is complicated by an inability to take a dense set of soil samples in large-scale farms. There is also a relative lack of traditional agronomic literature quantitatively describing crop yield as a function of the high-dimensional soil nutrient feature space.Machine learning and surrogate modelling methods are becoming increasingly common in engineering and science. While “black box” methods such as random forest regression and neural network modelling have been very successful at fitting physical phenomena, there is an increasing need to qualitatively and quantitively improve model interpretability and computational efficiency. In addition, machine learning models can be quite computationally expensive to use in optimization and may not have a well-defined methodology for doing so. Methods for improving computational efficiency of a model include coarse-graining (in the case of all-atom simulations) or approximating a “black box” model with another model designed to have tractable optimization properties. In order to retain fidelity to the initial model while increasing interpretability, in both cases the dimensionality of the model is reduced either by introducing multi-atom coarse-grain centers or approximating the target function as a linear combination of low-dimensional components. To ensure that the coarse-grain or reduced order surrogate models accurately capture properties of the original model, information from the model being approximated is used in their construction. In the case of reduced order surrogate modelling of random forest regression, low-dimensional components are chosen on the basis of ranked feature interaction importance. Using iterative Boltzmann inversion (IBI) to coarse-grain an all-atom simulation, the radial distribution functions of only a subset of atoms are used to reproduce structural and thermodynamic properties of the original system. The goal of the study performed in Chapter 2 was twofold: to use a data-driven methodology to model soybean yield (Glycine max L. Merr.) as a function of soil nutrients in well-irrigated soil and to develop a reduced order surrogate model capable of gradient ascent optimization. Several datasets were used to approximate soil nutrient concentrations using a random forest model: discrete soil samples, dense multispectral images of the plants near midseason from an unmanned arial vehicle (UAV), and a dense map of soil electrical conductivity. An iterative random forest (iRF) model was then fitted to a dense set of soil features, and important feature interactions of dimension 2 to 4 were extracted. Each feature interaction was used to generate a Highly Adaptive Lasso (HAL) pseudo-response surface corresponding to a low-dimensional projection of the feature space. We used the HAL surfaces to develop a reduced order surrogate model (ROSM) of the random forest; this ROSM is a linear combination of HAL surfaces derived from the feature interactions identified by the random forest. The resulting ROSM essentially has low local dimension because each component has maximum dimension 4. In practice, order 5 and 6 interactions were identified, but retaining them greatly decreased the computational efficiency of the HAL modelling and did not improve the model fidelity. Because the ROSM is a linear combination of low-dimensional surfaces, its gradient can also be described as a linear combination of the gradients of each surface. The ROSM can therefore be used in gradient ascent optimization at the same computational cost of evaluating the ROSM itself and is well-defined over the entire feature space. Maps of fertilizer application are derived for optimizing the soil concentrations of phosphorus and potassium. Chapter 3 is a study using iterative Boltzmann inversion to generate a coarse-grain model of an all-atom simulation of ion interstratification in illite clay. Experimental results indicate that cesium ions can exchange with potassium ions in bulk layered silicates, indicating that there is a mechanical or thermodynamic compensation for the incorporation of the larger cesium ion. Iterative Boltzmann inversion was used to incrementally update coarse-grain simulations of four clay layers by adjusting bonded and non-bonded interaction strength between coarse-grain centers, representing oxygen atoms in the clay layers and the ions themselves. The model was able to reproduce results from smaller all-atom simulations indicating that the barrier to ion exchange is a function of interlayer spacing, which in turn depends on the identity of the ion in the interlayer. By randomizing the position of ions in the interlayer between each coarse-grain simulation, the coarse-grain model was better able to sample the phase space and subsequently was not subjugated to overfitting based on the configuration of the ions. Most importantly, the coarse-grain model is able to run approximately 70 times faster than an all-atom simulation due to a roughly 2:1 reduction in the number of modelled particles. By eliminating explicit hydrogen atoms in the coarse-grain model, the time step could be increased by a factor of roughly 10.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View