Enthalpy of Solvation

Researchers: Brandon J Jaquis, Andrew SID Lang, Ailin Li, Nolan D Monnier
License: CC0

Data

We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.

Descriptors

We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.8) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens and exported as a .csv file: CDKDescriptors.csv.

Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA" using the following code.
setwd("C://..")
 
mydata = read.csv(file="CDKDescriptors.csv",header=TRUE,row.names="molID")
 
ncol(mydata) #get number of columns
[output]
287
[output]
 
mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA"
ncol(mydata)
[output]
222
[output]
write.csv(mydata,file="CDKDescriptorsNoNA.csv")
Here's the resulting file: CDKDescriptorsNoNA.csv. Then we used the caret package to remove columns with all the same value.
library(caret) ##for feature selection
nzv <-nearZeroVar(mydata)  ##remove zeros and other small variance columns
mydata <- mydata[, -nzv]
ncol(mydata)
[output]
120
[output]
write.csv(mydata,file="DescriptorsNoNZV.csv")
Here's the resulting file: DescriptorsNoNZV.csv. Opening the resulting file, we see that there are still a few "NA" values. Several rows molID: 57-62 (Helium, Neon, Argon, Xenon, Radon, Hydrogen) were removed due to multiple NA values. Several columns that had multiple "NA" values were also removed (HybRatio, Kier3, topoShape). This leaves us with a set of 138 deltaH values and 116 CDK descriptors: DescriptorsNoNZVCleaned.csv.

Feature Selection
A correlation matrix was built, and used to eliminate columns with a high covariance. Between two columns of high covariance, we removed the column with the least uniqueness to other columns.
mydata = read.csv(file="DescriptorsNoNZVCleaned.csv",head=TRUE,row.names="molID")
descrCorr <- cor(mydata)
highCorr <- findCorrelation(descrCorr, 0.95, verbose=T)
##remove highly correlated descriptors
mydata <- mydata[, -highCorr]
ncol(mydata)
[output]
86
[output]
write.csv(mydata,file="SolventInformationFeatureSelected.csv")
Here is the resulting file: SolventInformationFeatureSelected.csv

Chemical Space


Like in the previous year, the data contains 138 molecules with 86 descriptors. This shows an important level of consistency between the chemical spaces. Thus, the histogram of molecular weights is also the same:

Screen Shot 2018-01-30 at 3.32.03 PM.png

To better visualize the chemical space, the data set is projected using PCA. We also clustered the data to get a better understanding of the data spread.
library(fpc)
library(cluster)
mydata <-read.csv(file="SolventInformationFeatureSelected.csv", header=TRUE, row.names="molID")
 
#PCA
pc <- prcomp(mydata, scale. = T)
    p <- pc$x
    summary(pc)
[output]
Importance of components%s:
                         PC1    PC2    PC3    PC4     PC5     PC6     PC7     PC8     PC9
Standard deviation     4.8857 3.4874 2.9531 2.68900 2.36460 2.04054 1.80613 1.64269 1.5379
Proportion of Variance 0.2776 0.1414 0.1024 0.08408 0.06502 0.04842 0.03793 0.03138 0.0275
Cumulative Proportion  0.2776 0.4190 0.5204 0.60446 0.66947 0.71789 0.75582 0.78720 0.8147
[output]
 
#K-means clustering
mydataS <- scale(mydata)
asw <- numeric(5) #determining K
        for (k in 2:5){
        asw[[k]] <- pam(mydataS, k) $ silinfo $ avg.width
    }
    k.best <- which.max(asw)
    k.best
[output]
   2
[output]
    fit<-kmeans(mydataS,k.best) #clustering
 
#Saving Results
p<-cbind(p,fit$cluster) #Writing as one file
write.csv(p, file = "pca.csv") #Saving
The following picture is the k-means clustered data points graphed over PC1&2, and PC1&3. Unlike the past year, we scaled before selecting the number of clusters, resulting in 1 less cluster. The program appeared to cluster between the main group (containing 84.1% of the data) and a secondary outlier group.
clus.png
Here is the resulting file: pca.csv

Modeling

This is first model we created. No particular conditions were set for the randomForest function. Our sample size was 75% with a seed of 123.
library("randomForest")
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID")
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
mydata.rf <- randomForest(deltaH ~ ., data = train,importance = TRUE)
print(mydata.rf)
[output]
Call:
 randomForest(formula = deltaH ~ ., data = train, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 28
 
          Mean of squared residuals: 88.37216
                    % Var explained: 83.95
[output]
saveRDS(mydata.rf, file = "EnthalpyTraining Set") #Stores File
varImpPlot(mydata.rf, sort = T, main="Variable Importance")#Displays Importance Plot
The actual R model is contained in this file: EnthalpyTraining Set. Here is the Variable Importance Plot. Again we see nHBDon, ohs.sOH, TopoPSA are the three most important variables, although their order has changed.
Screen Shot 2018-02-21 at 1.08.38 PM.png
#Crossvalidation
training.predict <- predict(mydata.rf,train) #Predicts train set
    cluster.train <- fit$cluster[train_ind]
Training <- cbind(train, training.predict, cluster.train) #Combines all train data into one file
write.csv(Training, file = "RFTrainingSetPredict.csv") #Saves file
 
test.predict <- predict(mydata.rf,test) #Predicts test set
    cluster.test <- fit$cluster[-train_ind]
Test <- cbind(test, test.predict, cluster.test) #Combines all test set into one file
write.csv(Test, file = "RFTestSetPredict.csv") #Saves file
Here are the files: RFTrainingSetPredict.csv, RFTestSetPredict.csv.
train.pngtest.png
Our test set R^2 value is 0.887. Our test set RMSE 7.0988. For our individual clusters:
Cluster 1: RMSE: 6.558
Cluster 2: RMSE 9.737
Our file, with residuals and crossvalidation statistics is RFSetPredict.csv. Our chemical space with residuals is also shown below:
Screen Shot 2018-03-08 at 8.35.11 PM.png

H2O Deep Learning Model


As an alternative to the random forest model, we built an H2O Neural Network to try and create an even stronger model.To directly compared the results with the RF model, the same train and test set was used.
library(h2o)
h2o.init()
 
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID")
 
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible,,
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
#Prepare data for H2O
trainH2O<-as.h2o(train)
testH2O<-as.h2o(test)

Because the layer width is a critical component for accuracy, the ten models were created with the layer width varied. The process was repeated twice before selection, after which 10 more precise layer width were observed. It is important to note that the model was fine-tuned for the seed we were using.
residuals <- function(f,t){(f-t)^2} #functions for determining accuracy
R2 <- function(f,t){
     tavg<-mean(t)
     SSres<-sum(residuals(f,t))
     SStot<-sum((t-tavg)^2)
     output<-(1-sum(SSres)/SStot)
     print(output)
}
for (i in 1:10){
     hNeuralNet<-h2o.deeplearning(y="deltaH",
                                 training_frame=trainH2O,
                                  model_id="hNeuralNet",
                                  hidden=c((100*i),(100*i)),
                                  standardize=TRUE,
                                  seed=123,
                                  max_runtime_secs=300,
                                  variable_importances=TRUE)
 
     #cross validation
     predictTrain<-predict(hNeuralNet,trainH2O)
     predictTest<-predict(hNeuralNet,testH2O)
     R2(predictTrain,trainH2O$deltaH)
     R2(predictTest,testH2O$deltaH)
}
#Consistently high marks at 100
for (i in 1:10){
     hNeuralNet<-h2o.deeplearning(y="deltaH",
                                  training_frame=trainH2O,
                                  model_id="hNeuralNet",
                                  hidden=c((50+10*i),(50+10*i)),
                                  standardize=TRUE,
                                  seed=123,
                                  max_runtime_secs=300,
                                  variable_importances=TRUE)
 
     #cross validation
     predictTrain<-predict(hNeuralNet,trainH2O)
     predictTest<-predict(hNeuralNet,testH2O)
     R2(predictTrain,trainH2O$deltaH)
     R2(predictTest,testH2O$deltaH)
}

A hidden layer width of 130 was selected. The data was standardized to facilitate H2O, and the seed was set to make the results reproducible. The final model is shown below:
hNeuralNet<-h2o.deeplearning(y="deltaH",
                             training_frame=trainH2O,
                             model_id="hNeuralNet",
                             standardize=TRUE,
                             seed=123,
                             hidden=c(130,130),
                             variable_importances=TRUE)
summary(hNeuralNet)
[output]
H2ORegressionMetrics: deeplearning
 Reported on training data.
 **Metrics reported on full training frame**
 
MSE: 27.54657
RMSE: 5.248483
MAE: 3.770966
RMSLE: NaN
Mean Residual Deviance : 27.54657
 
Variable Importances:
 variable relative_importance scaled_importance percentage
1 nAtomP 1.000000 1.000000 0.013171
2 SPC.4 0.967459 0.967459 0.012743
3 khs.sssCH 0.964586 0.964586 0.012705
4 MLogP 0.961444 0.961444 0.012664
5 ATSm1 0.956220 0.956220 0.012595
---
80 nRotB 0.829468 0.829468 0.010925
81 SCH.7 0.827819 0.827819 0.010904
82 VCH.7 0.815953 0.815953 0.010747
83 C2SP2 0.801603 0.801603 0.010558
84 TopoPSA 0.798668 0.798668 0.010520
85 nAromBlocks 0.781421 0.781421 0.010292
[output]
 
#Saving information
predictTrain<-predict(hNeuralNet,trainH2O)
predictTest<-predict(hNeuralNet,testH2O)
 
resT <- residuals(predictTest,testH2O$deltaH)
resTr <- residuals(predictTrain,trainH2O$deltaH)
 
TEST <- cbind(as.data.frame(testH2O),as.data.frame(predictTest),as.data.frame(resT))
TRAIN <- cbind(as.data.frame(trainH2O),as.data.frame(predictTrain),as.data.frame(resTr))
     #cluster information will be added in manually
write.csv(TEST, file="NNTestSetPredict.csv")
write.csv(TRAIN, file="NNTrainSetPredict.csv")
h2o.saveModel(object=hNeuralNet, path=getwd(), force=TRUE)
Here are the resulting files: hNeuralNet, NNTrainSetPredict.csv, NNTestSetPredict.csv, NNSetPredict.csv
Screen Shot 2018-04-02 at 1.29.26 PM.pngScreen Shot 2018-03-27 at 4.03.23 PM.png
Our test set R^2 value is 0.903, an increase of 1.8% relative to RF. Our test set RMSE is 6.601, a 7.0% reduction in error. For our individual clusters:
Cluster 1: RMSE: 5.13
Cluster 2: RMSE 7.43
Our file, with crossvalidation statistics is NNTestSetCV.csv. Our chemical space with residuals is also shown below:
Screen Shot 2018-03-27 at 4.11.34 PM.png

References

1. Wiiliam E. Acree and Andrew SID Lang. Acree Enthalpy of Solvation Dataset. figshare. (2015). https://doi.org/10.6084/m9.figshare.1572326.v1