Modeling Enthalpy of Solvation in Ethanol

Researchers: Robert G Sisk and Andrew SID Lang

Data

We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.

Descriptors

We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.6) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens but did not include WHIM, CPSA, or any geometrical descriptors.
[include file with descriptors here]
Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA."

setwd("C://..")
 
mydata = read.csv(file="SolventInformationWithDescriptors.csv",head=TRUE,row.names="Key")
 
ncol(mydata) #get number of columns
[output]
221
[output]
library(caret) ##for feature selection
 
mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA"
ncol(mydata)
[output]
221
[output]
write.csv(mydata,file="DescriptorsNoNA.csv")
 
nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns
mydata <- mydata[, -nzv]
ncol(mydata)
[output]
119
[output]
write.csv(mydata,file="DescriptorsNoNZV.csv")

After the above code was executed, two columns corresponding to the two descriptors HybRatio and Kier3 wre deleted due to multiple NA values. Also, the following compounds (rows) were removed due to multiple NA values:

Compounds Removed
Helium
Neon
Argon
Xenon
Radon
Hydrogen

Feature Selection

Feature selection was performed using the following R code that removed highly correlated descriptors:
mydata = read.csv(file="Solvent InformationReadyForModeling.csv",head=TRUE,row.names="Key")
descrCorr <- cor(mydata)
highCorr <- findCorrelation(descrCorr, 0.95, verbose=T)
##remove highly correlated descriptors
mydata <- mydata[, -highCorr]
ncol(mydata)
[output]
86
[output]
write.csv(mydata,file="SolventInformationFeatureSelected.csv")

Description of Data

The data that we will use to create our RF models now consists of 138 ethanol enthalpy of solvation values with 86 descriptors (MW was added back to the data for informational purposes), see the histograms below:
EnthalpyHistogram.png
Histogram of Enthalpy of Solvation Data After Curation and Feature Selection
MWHistogram.png
Histogram of Enthalpy of Solvation Data After Curation and Feature Selection by Molecular Weight

Principal Component & Cluster Analysis

Cluster Analysis
## determining the best number of clusters
library(fpc)
library(cluster)
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
asw <- numeric(20)
for (k in 2:20)
asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width
asw
[output]
[1] 0.0000000 0.4144260 0.4179739 0.3358037 0.3437506 0.3564146 0.3714603 0.3462984 0.3108418
0.3074072 0.3351801 0.3365456 0.3388145 0.3351526 0.3109977 0.3156479 0.3159808
[18] 0.3116679 0.3293435 0.3328914
[output]
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
#3
 
##Cluster Analysis
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
mydata <- scale(mydata) # standardize variables
# K-Means Cluster Analysis
fit <- kmeans(mydata, 3) #3 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
[output]
Group.1 DeltaHsolv      ALogP     ALogp2        AMR    BCUTw.1l    BCUTw.1h
1       1  0.2704986  0.2374628 -0.1573636 -0.2879611  0.05055558  0.04366062
2       2 -1.1648089 -1.3120591  0.7743171  0.9234572 -0.19608056 -0.31448339
3       3 -0.7562661  0.2046194  0.1499057  1.7547322 -0.20620319  0.25735471
[output]
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
write.csv(mydata, file = "clusters.csv")

Principal Component Analysis
##PCA
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
pc1 <- prcomp(mydata, scale. = T)
x <- pc1$x
summary(pc1)
 
[output]
Importance of components%s:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7     PC8     PC9
Standard deviation     4.9562 3.4753 2.9841 2.7209 2.34156 2.04465 1.81481 1.63781 1.54550
Proportion of Variance 0.2823 0.1388 0.1023 0.0851 0.06302 0.04805 0.03786 0.03083 0.02745
Cumulative Proportion  0.2823 0.4212 0.5235 0.6086 0.67163 0.71969 0.75754 0.78838 0.81583
[output]
 
write.csv(x, file = "PCA.csv")

The following image depicts the chemical space that will be inputted into R to create the random forest model. We used 2 principal components that explain 42% of the variation and clustered the data into three groups upon the recommendation from the pam function in R.
PCAWithClusters.png
Chemical Space - Clustered Using R

Modeling

##Modeling
library("randomForest")
 
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
 
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
 
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
 
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
 
mydata.rf <- randomForest(DeltaHsolv ~ ., data = train,importance = TRUE)
print(mydata.rf)
[output]
Call:
 randomForest(formula = DeltaHsolv ~ ., data = train, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 28
 
          Mean of squared residuals: 95.31999
                    % Var explained: 82.69
[output]
saveRDS(mydata.rf, file = "EnthalpyTrainingSet")
 
training.predict <- predict(mydata.rf,train)
write.csv(training.predict, file = "RFTrainingSetPredict.csv")
 
test.predict <- predict(mydata.rf,test)
write.csv(test.predict, file = "RFTestSetPredict.csv")

TestSetPredictedVsMeasured.png

Predicted vs Measured DeltaHsolv for the test-set

The test-set statistics were: AAE 4.76, MSE 39.7, R2 82.9% (We did not include azelaic acid in our calculations).

Create RF model using all data.
library("randomForest")
 
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
 
mydata.rf <- randomForest(DeltaHsolv ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
 randomForest(formula = DeltaHsolv ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 28
 
          Mean of squared residuals: 65.72521
                    % Var explained: 87.47
[output]
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
saveRDS(mydata.rf, file = "EnthalpyOfSolvation")
 
training.predict <- predict(mydata.rf,mydata)
write.csv(training.predict, file = "AllDataRFPredict.csv")

The most important descriptors (nHBDon, TopoPSA, khs.sOH, AMR) can be seen on the following plot:
VariableImportance.png
Variable Importance Plot
Results
We can see...
PCAColouredByAE.png
PCA Colored by AE






References

1. Wiiliam E. Acree and Andrew SID Lang. Acree Enthalpy of Solvation Dataset. figshare. (2015). https://doi.org/10.6084/m9.figshare.1572326.v1