Researchers: Brandon J Jaquis, Andrew SID Lang, Ailin Li, Nolan D Monnier
License: CC0

Data

We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.

Descriptors

We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.8) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens and exported as a .csv file: CDKDescriptors.csv.

Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA" using the following code.

setwd("C://..")
mydata = read.csv(file="CDKDescriptors.csv",header=TRUE,row.names="molID")
ncol(mydata) #get number of columns
[output]
287
[output]
mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA"
ncol(mydata)
[output]
222
[output]
write.csv(mydata,file="CDKDescriptorsNoNA.csv")

Here's the resulting file: CDKDescriptorsNoNA.csv. Then we used the caret package to remove columns with all the same value.

library(caret) ##for feature selection
nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns
mydata <- mydata[, -nzv]
ncol(mydata)
[output]
120
[output]
write.csv(mydata,file="DescriptorsNoNZV.csv")

Here's the resulting file: DescriptorsNoNZV.csv. Opening the resulting file, we see that there are still a few "NA" values. Several rows molID: 57-62 (Helium, Neon, Argon, Xenon, Radon, Hydrogen) were removed due to multiple NA values. Several columns that had multiple "NA" values were also removed (HybRatio, Kier3, topoShape). This leaves us with a set of 138 deltaH values and 116 CDK descriptors: DescriptorsNoNZVCleaned.csv.

Feature Selection
A correlation matrix was built, and used to eliminate columns with a high covariance. Between two columns of high covariance, we removed the column with the least uniqueness to other columns.

Like in the previous year, the data contains 138 molecules with 86 descriptors. This shows an important level of consistency between the chemical spaces. Thus, the histogram of molecular weights is also the same:

To better visualize the chemical space, the data set is projected using PCA. We also clustered the data to get a better understanding of the data spread.

The following picture is the k-means clustered data points graphed over PC1&2, and PC1&3. Unlike the past year, we scaled before selecting the number of clusters, resulting in 1 less cluster. The program appeared to cluster between the main group (containing 84.1% of the data) and a secondary outlier group.

This is first model we created. No particular conditions were set for the randomForest function. Our sample size was 75% with a seed of 123.

library("randomForest")
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID")
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
mydata.rf <- randomForest(deltaH ~ ., data = train,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = deltaH ~ ., data = train, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 28
Mean of squared residuals: 88.37216
% Var explained: 83.95
[output]
saveRDS(mydata.rf, file = "EnthalpyTraining Set") #Stores File
varImpPlot(mydata.rf, sort = T, main="Variable Importance")#Displays Importance Plot

The actual R model is contained in this file: EnthalpyTraining Set. Here is the Variable Importance Plot. Again we see nHBDon, ohs.sOH, TopoPSA are the three most important variables, although their order has changed.

#Crossvalidation
training.predict <- predict(mydata.rf,train) #Predicts train set
cluster.train <- fit$cluster[train_ind]
Training <- cbind(train, training.predict, cluster.train) #Combines all train data into one file
write.csv(Training, file = "RFTrainingSetPredict.csv") #Saves file
test.predict <- predict(mydata.rf,test) #Predicts test set
cluster.test <- fit$cluster[-train_ind]
Test <- cbind(test, test.predict, cluster.test) #Combines all test set into one file
write.csv(Test, file = "RFTestSetPredict.csv") #Saves file

Our test set R^2 value is 0.887. Our test set RMSE 7.0988. For our individual clusters:
Cluster 1: RMSE: 6.558
Cluster 2: RMSE 9.737
Our file, with residuals and crossvalidation statistics is RFSetPredict.csv. Our chemical space with residuals is also shown below:

H2O Deep Learning Model

As an alternative to the random forest model, we built an H2O Neural Network to try and create an even stronger model.To directly compared the results with the RF model, the same train and test set was used.

library(h2o)
h2o.init()
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID")
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible,,
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
#Prepare data for H2O
trainH2O<-as.h2o(train)
testH2O<-as.h2o(test)

Because the layer width is a critical component for accuracy, the ten models were created with the layer width varied. The process was repeated twice before selection, after which 10 more precise layer width were observed. It is important to note that the model was fine-tuned for the seed we were using.

residuals <- function(f,t){(f-t)^2} #functions for determining accuracy
R2 <- function(f,t){
tavg<-mean(t)
SSres<-sum(residuals(f,t))
SStot<-sum((t-tavg)^2)
output<-(1-sum(SSres)/SStot)
print(output)
}
for (i in 1:10){
hNeuralNet<-h2o.deeplearning(y="deltaH",
training_frame=trainH2O,
model_id="hNeuralNet",
hidden=c((100*i),(100*i)),
standardize=TRUE,
seed=123,
max_runtime_secs=300,
variable_importances=TRUE)
#cross validation
predictTrain<-predict(hNeuralNet,trainH2O)
predictTest<-predict(hNeuralNet,testH2O)
R2(predictTrain,trainH2O$deltaH)
R2(predictTest,testH2O$deltaH)
}
#Consistently high marks at 100
for (i in 1:10){
hNeuralNet<-h2o.deeplearning(y="deltaH",
training_frame=trainH2O,
model_id="hNeuralNet",
hidden=c((50+10*i),(50+10*i)),
standardize=TRUE,
seed=123,
max_runtime_secs=300,
variable_importances=TRUE)
#cross validation
predictTrain<-predict(hNeuralNet,trainH2O)
predictTest<-predict(hNeuralNet,testH2O)
R2(predictTrain,trainH2O$deltaH)
R2(predictTest,testH2O$deltaH)
}

A hidden layer width of 130 was selected. The data was standardized to facilitate H2O, and the seed was set to make the results reproducible. The final model is shown below:

hNeuralNet<-h2o.deeplearning(y="deltaH",
training_frame=trainH2O,
model_id="hNeuralNet",
standardize=TRUE,
seed=123,
hidden=c(130,130),
variable_importances=TRUE)
summary(hNeuralNet)
[output]
H2ORegressionMetrics: deeplearning
Reported on training data.
**Metrics reported on full training frame**
MSE: 27.54657
RMSE: 5.248483
MAE: 3.770966
RMSLE: NaN
Mean Residual Deviance : 27.54657
Variable Importances:
variable relative_importance scaled_importance percentage
1 nAtomP 1.000000 1.000000 0.013171
2 SPC.4 0.967459 0.967459 0.012743
3 khs.sssCH 0.964586 0.964586 0.012705
4 MLogP 0.961444 0.961444 0.012664
5 ATSm1 0.956220 0.956220 0.012595
---
80 nRotB 0.829468 0.829468 0.010925
81 SCH.7 0.827819 0.827819 0.010904
82 VCH.7 0.815953 0.815953 0.010747
83 C2SP2 0.801603 0.801603 0.010558
84 TopoPSA 0.798668 0.798668 0.010520
85 nAromBlocks 0.781421 0.781421 0.010292
[output]
#Saving information
predictTrain<-predict(hNeuralNet,trainH2O)
predictTest<-predict(hNeuralNet,testH2O)
resT <- residuals(predictTest,testH2O$deltaH)
resTr <- residuals(predictTrain,trainH2O$deltaH)
TEST <- cbind(as.data.frame(testH2O),as.data.frame(predictTest),as.data.frame(resT))
TRAIN <- cbind(as.data.frame(trainH2O),as.data.frame(predictTrain),as.data.frame(resTr))
#cluster information will be added in manually
write.csv(TEST, file="NNTestSetPredict.csv")
write.csv(TRAIN, file="NNTrainSetPredict.csv")
h2o.saveModel(object=hNeuralNet, path=getwd(), force=TRUE)

Our test set R^2 value is 0.903, an increase of 1.8% relative to RF. Our test set RMSE is 6.601, a 7.0% reduction in error. For our individual clusters:
Cluster 1: RMSE: 5.13
Cluster 2: RMSE 7.43
Our file, with crossvalidation statistics is NNTestSetCV.csv. Our chemical space with residuals is also shown below:

## Enthalpy of Solvation

Researchers: Brandon J Jaquis, Andrew SID Lang, Ailin Li, Nolan D MonnierLicense: CC0

## Data

We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.## Descriptors

We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.8) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens and exported as a .csv file: CDKDescriptors.csv.Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA" using the following code.

Here's the resulting file: CDKDescriptorsNoNA.csv. Then we used the caret package to remove columns with all the same value.

Here's the resulting file: DescriptorsNoNZV.csv. Opening the resulting file, we see that there are still a few "NA" values. Several rows molID: 57-62 (Helium, Neon, Argon, Xenon, Radon, Hydrogen) were removed due to multiple NA values. Several columns that had multiple "NA" values were also removed (HybRatio, Kier3, topoShape). This leaves us with a set of 138 deltaH values and 116 CDK descriptors: DescriptorsNoNZVCleaned.csv.

Feature SelectionA correlation matrix was built, and used to eliminate columns with a high covariance. Between two columns of high covariance, we removed the column with the least uniqueness to other columns.

Here is the resulting file: SolventInformationFeatureSelected.csv

## Chemical Space

Like in the previous year, the data contains 138 molecules with 86 descriptors. This shows an important level of consistency between the chemical spaces. Thus, the histogram of molecular weights is also the same:

To better visualize the chemical space, the data set is projected using PCA. We also clustered the data to get a better understanding of the data spread.

The following picture is the k-means clustered data points graphed over PC1&2, and PC1&3. Unlike the past year, we scaled before selecting the number of clusters, resulting in 1 less cluster. The program appeared to cluster between the main group (containing 84.1% of the data) and a secondary outlier group.

Here is the resulting file: pca.csv

## Modeling

This is first model we created. No particular conditions were set for the randomForest function. Our sample size was 75% with a seed of 123.The actual R model is contained in this file: EnthalpyTraining Set. Here is the Variable Importance Plot. Again we see nHBDon, ohs.sOH, TopoPSA are the three most important variables, although their order has changed.

Here are the files: RFTrainingSetPredict.csv, RFTestSetPredict.csv.

Our test set R^2 value is 0.887. Our test set RMSE 7.0988. For our individual clusters:

Cluster 1: RMSE: 6.558

Cluster 2: RMSE 9.737

Our file, with residuals and crossvalidation statistics is RFSetPredict.csv. Our chemical space with residuals is also shown below:

## H2O Deep Learning Model

As an alternative to the random forest model, we built an H2O Neural Network to try and create an even stronger model.To directly compared the results with the RF model, the same train and test set was used.

Because the layer width is a critical component for accuracy, the ten models were created with the layer width varied. The process was repeated twice before selection, after which 10 more precise layer width were observed. It is important to note that the model was fine-tuned for the seed we were using.

A hidden layer width of 130 was selected. The data was standardized to facilitate H2O, and the seed was set to make the results reproducible. The final model is shown below:

Here are the resulting files: hNeuralNet, NNTrainSetPredict.csv, NNTestSetPredict.csv, NNSetPredict.csv

Our test set R^2 value is 0.903, an increase of 1.8% relative to RF. Our test set RMSE is 6.601, a 7.0% reduction in error. For our individual clusters:

Cluster 1: RMSE: 5.13

Cluster 2: RMSE 7.43

Our file, with crossvalidation statistics is NNTestSetCV.csv. Our chemical space with residuals is also shown below:

## References

1. Wiiliam E. Acree and Andrew SID Lang. Acree Enthalpy of Solvation Dataset. figshare. (2015). https://doi.org/10.6084/m9.figshare.1572326.v1