A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties

Machine learning has proven to be a powerful tool for accelerating biofuel development. Although numerous models are available to predict a range of properties using chemical descriptors, there is a trade-off between interpretability and performance. Neural networks provide predictive models with high accuracy at the expense of some interpretability, while simpler models such as linear regression often lack in accuracy. In addition to model architecture, feature selection is also critical for developing interpretable and accurate predictive models. We present a method for systematically selecting molecular descriptor features and developing interpretable machine learning models without sacrificing accuracy. Our method simplifies the process of selecting features by reducing feature multicollinearity and enables discoveries of new relationships between global properties and molecular descriptors. To demonstrate our approach, we developed models for predicting melting point, boiling point, flash point, yield sooting index, and net heat of combustion with the help of the Tree-based Pipeline Optimization Tool (TPOT). For training, we used publicly available experimental data for up to 8351 molecules. Our models accurately predict various molecular properties for organic molecules (mean absolute percent error (MAPE) ranges from 3.3% to 10.5%) and provide a set of features that are well-correlated to the property. This method enables researchers to explore sets of features that significantly contribute to the prediction of the property, offering new scientific insights. To help accelerate early stage biofuel research and development, we also integrated the data and models into a open-source, interactive web tool.


Introduction
Machine learning can leverage a breadth of experimental data in the public domain to accelerate fundamental and applied research for biofuel development.For example, machine learning models can predict relevant biofuel production pathways, as well as physical and chemical properties of potential biofuel molecules [1][2][3].Preliminary fuel screening tools based on machine learning have already proven useful in programs worldwide and help identify promising molecules early in the fuel development cycle [1][2][3][4][5][6].Many of these property prediction models also use chemical descriptors as features to provide a mathematical link between physiochemical properties and molecular structure [1][2][3][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24] features in quantitative structure-property relationship models can provide new information about the property predicted, and lead to better understanding of the link between physical or chemical properties and molecular structure [32].Many predictive models developed for identifying correlations between descriptors and properties of interest (i.e., interpretability) use multi-linear equations or least-squares regression [3,[7][8][9]15,[17][18][19][20][21].These models are used because the linear coefficients directly correlate the descriptors' contributions to the prediction.However, the models may lack the predictive performance of more complex models such as neural networks, especially when applied to systems that exhibit nonlinear relationships [3,9,18,19].For example, Kessler et al. [3] compared the performance of Multiple Linear Regression (MLR), Artificial neural network, and Graph neural network models for predicting yield sooting index.The MLR model provided insights into the structural components of a molecule that contribute to increasing yield sooting index, but had a mean absolute error (MAE) about seven times larger than the two neural network models.
In exchange for some interpretability, many researchers have used neural networks to develop fuel property models with high accuracy [2,3,[9][10][11]14,16,18,19].Although neural networks often perform better than other model architectures, they do not impose restrictions on input variables (i.e., features) or data relationships.This lack of restrictions can lead to combined features and hidden relationships that are not easily explained, making the models difficult to interpret from a mechanistic standpoint.For example, a neural network model may combine several molecular descriptors to create a new feature that minimizes error but does not directly correlate with a molecule's physical characteristics.Also, many neural networks do not show how correlations between features and the property are made during model development.As a result, a whole field of research has been dedicated to understanding their hidden relationships and model structure [33][34][35][36].
Feature selection is also critical for developing interpretable and accurate predictive models.It can reduce the risk of overfitting and identify important features with meaningful property relationships in the data [37].Feature selection can also reduce multicollinearity in feature sets (i.e., features that are linearly correlated), which is often present in molecular descriptors and may promote unstable model coefficients [38].Commonly used feature selection methods include statistical methods (e.g., using Pearson or Spearman correlation coefficients) or manually selected features based on experimental observations about the property of interest.Although these methods are effective at reducing features, they may overlook important features that offer new scientific insights about the property.For example, using statistical methods that ignore the property of interest may result in a set of non-correlated features irrelevant for that property.Manually selecting features could exclude important features that better characterize the data (e.g., distinguishing between isomers) [1,11,13,14,23,39].Given the numerous methods for feature selection with varying drawbacks and constraints, it may be daunting to select one for developing a predictive model.
The purpose of this study is to develop a systematic method for creating accurate and interpretable fuel property prediction models that use molecular descriptors.Unlike previously published literature, this approach can be applied to a broad range of properties from physical to complex, and can be used to support scientific discovery.Specifically, the method enables researchers to identify, rank, and validate important property structure relationships that may accelerate fuel development.The method focuses on reducing the number of features by minimizing correlations between chemical descriptors to develop highperforming models.It also ranks the features based on their importance, enabling researchers to identify dominant chemical-structure features that impact property values.
To demonstrate our approach, we created predictive models for five common jet fuel properties that are considered when certifying new jet fuels [40]: melting point, boiling point, flash point, yield sooting index, and net heat of combustion.We selected these properties because they range from physical to complex, with some having a clear relationship with molecular structure (e.g., heat of combustion, yield sooting index) [3,22] and others having ambiguous structure-property relationships [7,10].We used automated machine learning (TPOT) to develop the models, which leverage chemical descriptors from Mordred [28] and experimental property data of organic molecules from publicly available sources.We validated model accuracy using test data withheld from training and published literature.To demonstrate interpretability of our approach, we provide an in-depth discussion of the important features and how they correlate with properties.The data and models have been integrated into a user-friendly, interactive web tool (feedstock-to-function.lbl.gov) and are publicly available to help accelerate early stage biofuel development research [41,42].Following this section, we provide a summary of previously published models.We then describe our methods for selecting an optimal number of features and developing interpretable machine learning models.Next, we present our results, draw conclusions, and provide recommendations for future extensions of this work.

Previous models
Several researchers have used molecular descriptors, molecular fragments, or functional groups as inputs (or features) for regression, or classification models to predict molecular properties relevant to biofuels.
Table 1 provides a summary of their work.Classical machine learning is based on mathematical algorithms such as linear regression, decision trees, or support vector machines.Deep learning algorithms are based on neural networks, and include perceptrons, artificial neural networks, or adversarial networks.Bayesian methods include Gaussian process regression, or Bayesian linear regression.Some studies only reported training or overall errors that averaged training, testing, and validation (when available) errors [1,2,12,15,39].As noted by other researchers, reporting the test error provides a better measure of the model's predictive capability because external validation is necessary for determining the true predictive ability of the model [17].
In general, smaller or molecular family-specific (e.g., alcohols, alkanes, hydrocarbons, unsaturated hydrocarbons, and heterogeneous molecules) data sets result in lower errors for the boiling point, flash point, and yield sooting index models [11,12,16,17].Melting point and heat of combustion models, however, have lower errors with the largest data sets [10,22].For most of the properties, model type does not have an obvious impact on model accuracy.For example, most of the classical melting point models report an root mean squared error (RMSE) of about 40 K for their test data [7][8][9], with the neural network model having the largest RMSE (49 K) with the largest data set [10].Boiling point shows a similar trend with the smaller data-set, classical models having lower errors than the larger data-set, neural network models.Due to lack of reported test errors, there is not enough information to determine how model type might impact flash point and yield sooting index model accuracy.
When comparing feature types, heat of combustion model performance was significantly better when functional groups (i.e., group contribution methods) were used as features instead of molecular descriptors [9,23].Saldana et al. [9] developed 11 models using both descriptors and functional groups.To achieve the best performing model, they used a consensus modeling approach, where results from the best performing (lowest error) descriptor model and functional group model were averaged to produce the final prediction.For the remaining properties, there was not enough data to identify any additional trends with feature types.
Despite the significant advancements for developing models for predicting properties, a different feature selection method was used for almost each study.Approaches ranged from manually selecting features to using varied, and sometimes multiple, statistical methods.Approximately half of the prior studies we surveyed discussed the relationship of the model features to the property and most of them used classical learning models [7,8,[13][14][15]17,[20][21][22].This is likely because classical learning models can provide direct information on how features impact model predictions.For example, two classical models used to better understand how specific chemical features affect melting point of potential drugs provided an extensive discussion on the relationship between individual descriptors and their effect on melting point [7,8].When using a deep learning model, obtaining detailed information to understand the scientific relevance of the models may be more difficult [3].

Experimental property data
To train the property prediction model, we aggregated and coalesced experimental property data for organic molecules from publicly available, published sources [2,4,19,39,[45][46][47][48][49][50][51].Property data included flash point, boiling point, melting point, heat of combustion, and yield sooting index.While heat of combustion has been linked to molecular structure, properties such as flash point and yield sooting index also depend on combustion chemistry.The wide range of these properties illustrates the versatility of this approach, and the potential to be applied to other properties.For flash point, we only included open-cup data (i.e., excluded Pensky-Martens, Abel, Tag, and Setaflash methods of testing flash point) to ensure results were comparable with each other.Because our models focus on biofuel and bioproduct development, we only included organic molecules containing carbon, hydrogen, oxygen, and/or nitrogen and restricted molecules to those with 30 or fewer carbon atoms.If multiple sources reported experimental data for the same molecule, we compared measurements and either averaged or removed the data.Specifically, measurements within 15% or five units of each other were averaged, while measurements differing by more than 15% and five units were considered unreliable, and the molecule was removed.The final database comprises a variety of chemical classes, such as alkanes, cycloalkanes, alkenes, cycloalkenes, alkynes, alcohols, cycloalcohols, aldehydes, ketones, cyclic ketones, esters, ethers, and aromatics (see Table S2 in the supplementary material for more information).Compared to published literature, our dataset has considerably more (up to 30 times) alcohols, cycloalcohols, aldehydes, ketones, cyclic ketones, esters, ethers, carboxylic acids, and aromatics, a comparable number of alkanes, cycloalkanes, alkenes, cycloalkenes, and alkynes to other datasets [1].
Using Pandas [52], Scikit-learn [53], and RDKit [54], we retrieved simplified molecular input line entry specification (SMILES) for each molecule to index and merge data from different sources.To index the database by SMILES, two challenges arise.First, isomers need unique SMILES to be correctly identified.For this reason, we obtained isomeric SMILES for the molecules with isomers.Second, a single molecule can have multiple SMILES.To avoid duplicated molecules when merging data, we standardized the SMILES using the MolToSmiles() and MolFromSmiles() functions in RDKIT.Chem.These functions map different SMILES belonging to the same molecule to a single molecule object in Rdkit, then return a single standardized SMILES.If a published database did not index molecules by SMILES, such as Co-Optima [4], we used the names of molecules to find SMILES via PubChem [46].The complete database used for developing our models can be found at feedstock-to-function.lbl.gov.

Feature selection
For our model features, we generated molecular descriptors using Mordred [28] because it is an open-source library that offers a wide variety of descriptors.Mordred includes more than 1800 features grouped by 50 categories called modules.Descriptors are generated from SMILES and can be two-or three-dimensional.After generating all descriptors for each molecule, we removed descriptors with nonnumerical values (e.g., containing strings or NA values) and descriptors with matching values across most (>95%) molecules.We also removed the Autocorrelation descriptors because they may not directly correspond to structural or physical properties of the molecule [55].
When reducing the number of features, we used Recursive Feature Elimination (RFE) with an ensemble model estimator (Random Forest Regressor) because it was used in other studies, is easy to implement, and can outperform other approaches [2,56,57].RFE is a supervised feature selection method.Fig. 1 shows a flowchart of our model development process.To effectively use RFE, we first removed correlations between descriptors by developing correlation matrices for each descriptor module using the Spearman coefficient [58].Although the Pearson coefficient is more commonly used [14,19], the Spearman coefficient captures not only linear but all monotonic relationships.The Spearman coefficient is also appropriate for discrete ordinal and continuous variables, both of which are present in molecular descriptors.
For each descriptor in the module correlation matrix, we counted the number of descriptors with a correlation coefficient of at least 0.7.We then ranked the descriptors from highest to lowest by number of correlated module descriptors.Next, we selected the top descriptor as a feature descriptor and removed that feature descriptor and its correlated descriptors from the correlation matrix.Last, we regenerated the correlation matrix and repeated these steps until all descriptors in the module were either selected as a feature descriptor or deleted.If descriptors had the same number of correlated descriptors and therefore the same ranking, they were selected by their order in Mordred.We repeated this process for individual modules and then a final time after combining all feature descriptors.This process generated the feature descriptor set for RFE.
To minimize bias in the model performance, we randomly divided each property data set into training (80%) and testing (20%) prior to using RFE.Fig. S2 and Table S1 in the Supplemental Information (SI) show the distribution of training and testing data and the number of molecule types (e.g., hydrocarbons, oxygenated hydrocarbons) used for each model.
Next, we implemented RFE using scikit-learn [53], which iteratively groups and removes the 10 least important descriptors until no descriptors remain.The importances of the descriptors are derived by fitting a random forest estimator at each step and looking at the accumulation of the impurity decrease within each tree [59].Upon changing the random seed 10 times for heat of combustion, 7 out of the top 10 features were ranked 1st all 10 times, while the other three were ranked either 1st or 2nd.This shows that the method produces consistent results when randomizing the estimator.It then ranks the descriptor groups based on their removal order.For example, the last remaining descriptors will be ranked 1, while the second-to-last remaining descriptors will be ranked 2, and so forth.This method has been shown to be robust and useful for identifying strong predictors in low dimensional data [60].
We then created multiple feature sets, with the first feature set including only rank 1 descriptors, by sequentially adding higher ranked descriptor groups until the last feature set included descriptors of all ranks.We then trained TPOT on all the generated feature sets.
To evaluate the robustness of this approach, we varied the random seed ten times, retrained the models, and observed the changes in feature importances.We found that random seed did not have more than 20% variability in any of the feature importances, with the maximum change being between 0.48 and 0.6, in the yield sooting index model (see Fig. S1 in the SI for more information).

Model development using automated machine learning
TPOT is a tree-based automated machine learning tool that finds the best model for a given data set by exploring thousands of model pipelines [29,30].For our models, we set TPOT to optimize only ensemble models because they can provide prediction intervals with quantile regression forests or quantile extra trees.This allowed us to estimate conditional quantiles and evaluate the reliability of the prediction for each molecule [61].When training the models, TPOT performed fivefold cross-validation with each fold comprising 20% of the training set.TPOT performs the cross-validation internally, using the average accuracy of all five folds to select the highest performing model architecture.Then, it trains the highest performing model architecture with the entire training set and returns the final model [29,30].For all feature sets described in Section 3.2, we used TPOT to develop models and then compared model performances.Like previous studies [1,3,9], we selected the final predictive model for the property, and its corresponding feature set, with the lowest cross-validation error.Fig. 1 provides a graphical representation of the feature and model selection process.

Model performance metrics
To evaluate the prediction performance and accuracy of each model, we calculated the RMSE, MAE, MAPE, and median absolute error (MedAE).These metrics do not indicate if the model predictions overor under-estimate experimental data.Instead, they measure model performance and accuracy.They can be used on the train, test, or validation sets, but the performance on the test or validation sets should be used to estimate model's performance on unseen data.
The RMSE is frequently used to estimate model error: where  is the number of samples (or molecules),   is the dependent variables, with  = 1, … , , and ŷ is the predicted value of   .Due to the quadratic nature of RMSE, this metric is especially sensitive to large errors and outliers.MAE is also a popular performance metric because it is intuitive and easy to interpret: ( MAE is less sensitive to large errors and outliers than RMSE, which may be desirable if the data span a large range.MAPE is similar to MAE and provides a dimensionless (i.e., unit-less) measure of model accuracy: However, MAPE captures errors that may seem large in terms of magnitude, but are small compared to the overall range of the data.This metric places more penalty on errors at the lower end of the target range.For example, MAPE will demonstrate that an error of 20 units when the expected value is 5 is less accurate than an error of 20 units when the expected value is 500.Because MedAE is the median of all the absolute errors, it is robust to outliers and provides a measure of model performance for the majority of the data (i.e., excluding outliers): where median() represents the median value of set .

Results and discussion
To develop accurate and interpretable machine learning models, we determined the optimal descriptor sets and created predictive models for five physiochemical properties using the method described in Section 3. We developed models for melting point, boiling point, flash point, yield sooting index, and heat of combustion.A summary of the models and their data performance (training and testing) is shown in Table 2.
Interestingly, the optimal model architecture identified by TPOT for all properties was ExtraTrees.This algorithm creates a large number of regression trees each trained on a random subset of features, then averages the prediction of each tree to come up with a final prediction.It has been shown to provide near optimal accuracy and good computational complexity [62].In addition, variable importances derived from ensemble models such as ExtraTrees can properly assess the relevance of a variable [59].
In general, our predictive models demonstrate comparable performance to previously published models.Our melting point model uses two to eight times more data than previous studies and has test errors comparable to Saldana et al. [9] (about 4 K higher MAE and RMSE) and lower than Karthikeyan et al. [10] (about 10 K lower MAE and RMSE).Our boiling point and flash point models have similar errors to previous models of larger size (greater than 1000 molecules) and diverse data-sets (i.e., models that were trained using multiple classes of organic compounds) [16,18,19].The yield sooting index model also has errors that match previous studies, but uses only a fraction of the features (15 for our model versus 390 or 1800 for others), which may help prevent overfitting [2,3].Although the heat of combustion model performed better than linear models that used chemical descriptors as features (MAE of 91.9 kJ/mol versus 104.1 kJ/mol) [21,22], it did not perform as well as models that used functional group counts as inputs [9,23].For heat of combustion, it seems that using a group contribution approach compared to molecular descriptors as model inputs generally produces better model performance [63,64].This is likely due to the fact that fuel enthalpy, which is related to heat of combustion, has been shown to correlate to functional groups.As shown in Table 1, this may not be consistent for other physiochemical properties.
In some cases it was difficult to compare performance of our model to the published literature because they only reported overall errors that averaged both training and testing errors (i.e., errors for the testing data were not provided) [1,12,15,17,18,20].Reporting model performance using the average of training and testing errors may be misleading because the model is likely to have lower errors with training data than testing data.Therefore, the overall error may overestimate the predictive capabilities of the model, especially with new data.As discussed by other researchers, external validation is an indispensable validation method for determining the true predictive ability of the model, and reporting the test error will provide a better measure of the model's predictive capability [17,65,66].
When evaluating model performance for different molecular families (e.g., hydrocarbons, oxygenated hydrocarbons, organic nitrogen molecules), Fig. 2 shows that most models predict hydrocarbon properties better than nitrogen-and oxygen-containing compounds.Hydrocarbons may be easier to predict because their properties tend to correlate highly with bond structure and intermolecular connectivity [67].For nitrogen-and oxygen-containing compounds, additional intermolecular forces (such as dipole-dipole moments) and degrees of freedom may influence properties in ways that are difficult to capture with available molecular descriptors.Yield sooting index is the only model with higher errors for hydrocarbons than other molecule families; however, the yield sooting index range for hydrocarbons is more than 10 times larger than oxygenated hydrocarbons and organic nitrogen molecules (see Tables S1 and S3 in the SI).The remaining models have the largest error among organic nitrogen molecules.For information about RMSE for these groups, see Table S4.
The following sections discuss in detail the chemical descriptors used for each model and interpret their relationships to the properties.Additionally, the SI contains a full list of descriptors, their definition, and their feature importance values for each model.

Melting point model
To interpret the melting point model features (i.e., descriptors) and understand which characteristics are captured by the model, Fig. 3 shows the feature importance of the descriptors grouped by module.Melting point depends on molecule properties and strength of the crystalline lattice.These are functions of molecular packing in the crystals (e.g., shape, size, symmetry) and intermolecular forces such as charge transfer and dipole-dipole interactions in the solid phase [10,[68][69][70].Melting point also depends on many entropic parameters such as oligomerization and other self-organization processes that may not be captured by chemical descriptors [7,8,10,68,69].
The highest ranked descriptor in the melting point model is piPC3, defined as the 3-ordered pi-path count (log scale) [28].piPC3 is the only descriptor used from the PathCount module and captures aspects of the molecular structure.Specifically, it measures the amount of branching in the bonded structure of the molecule, where aromatic bonds are weighted more heavily than single bonds [71].As shown in Fig. S3, melting point tends to increase as the number of 3-ordered pi-path count (log scale) increases.Additionally, given a fixed value for piPC3, oxygenated hydrocarbons and organic nitrogen compounds have higher melting points than their hydrocarbon counterparts.
The second highest ranked descriptor is MZ, the only descriptor used from the Constitutional module.MZ is defined as the constitutional mean and does not contain geometric information.It is determined by calculating the mean atomic number of the molecule and then normalizing the mean by the atomic number of carbon (i.e., 6) [28].In general, melting point increases with MZ (see Fig. S4).
An organic molecule with a nitrogen or oxygen atom has a greater MZ value than a hydrocarbon with only carbon and hydrogen atoms because nitrogen and oxygen have greater atomic numbers than carbon.The dipole-dipole interactions between either the lone pairs in nitrogen or oxygen and hydrogen are stronger than the dipole-dipole interactions between carbon and hydrogen.This effect is owed to nitrogen and oxygen having greater electronegativity than carbon, which results in stronger intermolecular forces that require more energy to break (i.e., a greater melting point).This could explain why MZ is an important predictor of melting point in our model.The module with the largest aggregated feature importance is MOEType, which comprises 34 descriptors, including those that characterize intermolecular interactions.The MOEType module collects two-dimensional descriptors from the Molecular Operating Environment Software, based on precalculated surface area values (or VSA) derived from a list of functional groups that approximate van der Waal surface area [72,73].MOE-Type descriptors capture many fundamental characteristics needed to predict melting point, including hydrophobic and hydrophilic effects, polarizability, electrostatic interactions, and steric effects [7,10,72,74,75].
Most of the remaining descriptors and modules in our melting point model capture additional features that characterize molecular shape, atoms, bond types, functional groups, and polarizability.For example, topological descriptors count numbers of atoms, bonds, branching, and electrons, and can be used to characterize polarizability, dipole moment, and some steric effects [76].Polarizability and induced dipole interactions influence intramolecular interactions that are important determinants of melting point, especially for larger molecules with electrons that are easier to polarize [8,10].
A key difference between our melting point model and other published models is that it can distinguish between isomers because it includes three-dimensional descriptors.Only two three-dimensional descriptors are used in the melting point model: Plane of Best Fit (PBF) and GeometricalShapeIndex.Interestingly, these descriptors have a low feature-importance ranking.Previous research has suggested that melting point models without three-dimensional descriptors perform better than those with both two-and three-dimensional descriptors [10].Here, we found that the addition of three-dimensional descriptors improved accuracy.
Given the diversity of the data used to train the model, common descriptors between many molecules may be more heavily weighted than other descriptors, resulting in larger prediction errors.For example, more than half of the molecules used for training contain a nitrogen atom and the third highest ranked descriptor is the number of nitrogen atoms (nN) in the AtomCount module.As another example, the model predicts that 1,2-epoxyhexane (C 6 H 12 O) has a low melting point (171 K) like 1,3-epoxybutane (C 4 H 8 O), but its actual melting point is more than double (367 K).This indicates that some of the highestranked modules, such as Estates, RingCount, and AtomCount, that characterize the types of atoms and fragments or functional groups present in the molecules are missing some features, resulting in higher model predictions.Specifically, the descriptors may not accurately characterize molecules with stronger intermolecular interactions in the solid state due to oligomerization or other self-organization processes.Other researchers have observed similar phenomenon using descriptors to predict melting point, and indicate that available descriptors may not be sufficient to accurately predict melting point of molecules with strong, solid-state intermolecular interactions [7,10,68,69].

Boiling point model
To understand which characteristics are captured by the boiling point model, we compared the importance of individual descriptors (see Fig. 4).Boiling point depends on physical characteristics such as branching in the molecular structure, different measures of intermolecular connectivity, and the dipole-dipole interactions within the molecule [13,14,17].Therefore, it is not surprising that piPC1 is the highest ranked descriptor for the boiling point model, since it measures the amount of branching in the molecular structure.piPC1 is defined as the 1-order pi-path count (log scale) and is the only descriptor used from the PathCount module [28].As Fig. S5 shows, boiling point generally increases with increasing path count values, which agrees with previous research by Dai et al. [17].The second-highest ranked descriptor is VE2_A, from the AdjacencyMatrix module, which describes intermolecular connectivity and vertex centrality [14,17,28].Similar to previous studies, Fig. S6 shows that boiling point is inversely proportional to VE2_A values [14,17].The nHBDon descriptor, which represents the number of hydrogen bond donors in the molecule, is ranked fourth-most important in our boiling model.
The MoRSE module was ranked second highest in the boiling point model, after PathCount.This module of descriptors was developed to encode the three-dimensional structure of a molecule without Cartesian or internal coordinates.The calculation involves the Euclidean distance between atoms, the total number of atoms, and different atomic properties [77].When unweighted, the descriptors in this module are just functions of number of atoms in the molecules.These descriptors can also be weighted by atomic properties such as mass, van der Waals volume, or Sanderson electronegativity, which will highlight or diminish the role of certain atoms in the molecule.In the boiling point prediction model, the descriptors used are unweighted, and weighted by mass, van der Waal volume, and polarizability.MoRSE three-dimensional descriptors have been used for predicting boiling point, specifically weighted by van der Waal volume [14].

Flash point model
Fig. 5 shows the importance of features used in the flash point model.Flash point correlates well with boiling point, and some studies even used one property to predict the other [78,79].Therefore, it is not surprising that the top descriptors used in the flash point model, as well as their relative importances, mimic those in the boiling point model.Specifically, the top ranked descriptors are TpiPC10 in the PathCount module and VE2_A in the AdjacencyMatrix module.Figs.S8 and S7 show that VEA_2 may correlate better with hydrocarbons than TpiPC10.While the PathCount module has been successfully used in previous flash point models, the AdjacencyMatrix module has not [18].However, similar descriptors that capture the number of bonds and bond strength of a molecule, such as edge adjacency indices, have been used in some models [20].
Like in the boiling point model, the MoRSE module has the highest aggregated importance in the flash point model and captures threedimensional structure of the molecules.The nHBDon descriptor also captures similar effects, as increasing hydrogen bond donors increases flash point, and has been used in previous models [23].

Yield sooting index model
Fig. 6 shows the importance of features used in the yield sooting index model.Aromaticity in molecules correlates with higher sooting tendencies (i.e., higher yield sooting indices) [39,80].As expected, the highest ranked descriptor, which has an importance greater than all other descriptors combined, is nAromBond in the Aromatic module.This descriptor counts the number of aromatic bonds [28].As noted by previous researchers, molecules with aromatic atoms have significantly higher yield sooting index than their non-aromatic counterparts [39].Our model shows a similar trend, with yield sooting index generally increasing with the number of aromatic bonds (see Fig. S9).
Although nAromBond is an important feature in the yield sooting index model, nAromBond does not capture structure of non-aromatic portions of the molecule or information about non-aromatic molecules.As such, the remaining descriptors in the model capture non-aromatic features.For example, the second-highest ranked descriptor was ABC, defined as the atom-bond connectivity index, and measures branching [28].This is the most important descriptor in the heat of combustion model, and details about this descriptor are included in Section 4.5.
Additionally, four descriptors come from the BCUT module, which is  calculated from the Burden matrix (a vertex-and edge-weighted adjacency matrix) [81].One of the descriptors that measures the highest eigenvalue of the burden matrix weighted by polarizability is also used in the yield sooting index model developed by Kessler et al. [3].

Heat of combustion model
Fig. 7 shows the importance of features used in the heat of combustion model.The highest ranked descriptor in the heat of combustion model is the ABC descriptor, with a feature importance of almost 0.5.The ABC descriptor is the Atom-bond Connectivity Index, or ABC Index, developed as an improved version of the Randić Connectivity Index, which measures branching in saturated hydrocarbons [67,82].The ABC Index is a degree-based molecular structure descriptor used to model thermodynamic properties of organic chemical molecules and advance nanochemical applications [83].The ABC Index correlates with the stability of linear and branched alkanes as well as the strain of energy of cycloalkanes.It also correlates with heat of formation of alkanes and cycloalkanes, and can predict their thermodynamic properties [67,84,85].
As shown in Fig. S10, ABC correlates well with heat of combustion, and clearly correlates with alkanes and cycloalkanes, agreeing with previous studies [67,84].Almost 50% of our data comprise hydrocarbons (1164 out of 2490 molecules), with 25% being alkanes and cycloalkanes (498 alkanes and 121 cycloalkanes).This may partially explain the high importance of the ABC descriptor in the model.
The second-highest ranked descriptor is ETA_dBeta from the ETA module. is the valence electron mobile (VEM) count, and sums contributions from sigma bonds, pi bonds, and a  term that measures resonating lone pair electrons in an aromatic system [83].ETA_dBeta is defined as the difference between the contribution from non-sigma bonds (i.e., pi-bonds and ) and sigma bonds [28].From Fig. S11, we can see that this descriptor better-isolates organic nitrogen molecules than ABC.This suggests that ETA_dBETA may encode structural features to better characterize organic nitrogen molecules.

Conclusions
This research establishes a comprehensive method for developing interpretable models that predict multiple molecular properties.The method can be applied to a broad range of properties from physical to complex and can help researchers identify, rank, and validate important property structure relationships that may accelerate biofuel development.The method focuses on reducing the number of features by minimizing correlations between chemical descriptors to develop highperforming models.It also ranks the features based on their importance, enabling researchers to identify dominant chemical-structure features that impact property values.
To demonstrate our method, we developed molecular property prediction models for five common jet fuel properties: melting point, boiling point, flash point, yield sooting index, and net heat of combustion.The properties range from physical to complex, having known and unknown relationships with molecular structure or combustion chemistry.Data used to train the models contain organic molecules (specifically, carbon atoms attached to hydrogen, oxygen, and/or nitrogen atoms) with less than 30 carbon atoms.The numbers of molecules used to develop the models range from 481 molecules (yield sooting index) to 8351 molecules (melting point).
The MAPE for the models, based on the test data, range from 3.3% (boiling point) to 10.5% (yield sooting index), performing similarly to previously published models.A key advantage to these models is that they enable users to directly explore the relationships between properties and the importances of individual descriptors or modules used by the model.For example, we found that the Atom-bond Connectivity Index, a measure of molecule branching, well-predicts heat of combustion, especially for alkanes and cycloalkanes.We also observed that the number of aromatic bonds is a good predictor of yield sooting index, agreeing with previous research.The consistency of this method was tested by randomizing different parts of the algorithm and comparing the results, which are consistent and highlight the same features each time.
Overall, our method provides a consistent and robust approach for developing physiochemical property-prediction models.To accelerate early biofuel research and development, we integrated the data and models into a user-friendly, interactive webtool that is publicly archived and can be found at feedstock-to-function.lbl.gov[41,42].From this research, we recommend that future models report test errors in addition to overall errors that combine testing and training errors.This will improve transparency for model performance, especially with new or unseen data.Additionally, this method could be used to develop models for other physiochemical properties (e.g., viscosity, density) or to reduce multicollinearity of other highly correlated feature sets similar to chemical descriptors (e.g., gene expression datasets) when developing models.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Overview of feature selection and model optimization pipeline.

Fig. 2 .
Fig. 2. Parity plots showing test set experimental and predicted values of (a) Melting Point, (b) Boiling Point, (c) Flash Point, (d) Yield Sooting Index, and (e) Heat of Combustion models for hydrocarbons, oxygenated hydrocarbons, and all other organic molecules.
. Additionally, chemical descriptors can be

Table 1
Published quantitative structure property relationship models for predicting melting point, boiling point, flash point, yield sooting index, and heat of combustion.

Table 2
Property prediction molecules, features, and model test performance using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Median Absolute Error (MedianAE), and Mean Absolute Percentage Error (MAPE) for test data.