EOS001

Modeling Enthalpy of Solvation in Ethanol
Researchers: Robert G Sisk and Andrew SID Lang

Data
We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.

Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.6) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens but did not include WHIM, CPSA, or any geometrical descriptors. [include file with descriptors here] Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA."

code setwd("C://..")

mydata = read.csv(file="SolventInformationWithDescriptors.csv",head=TRUE,row.names="Key")

ncol(mydata) #get number of columns [output] 221 [output] library(caret) ##for feature selection

mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA" ncol(mydata) [output] 221 [output] write.csv(mydata,file="DescriptorsNoNA.csv")

nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns mydata <- mydata[, -nzv] ncol(mydata) [output] 119 [output] write.csv(mydata,file="DescriptorsNoNZV.csv") code

After the above code was executed, two columns corresponding to the two descriptors HybRatio and Kier3 wre deleted due to multiple NA values. Also, the following compounds (rows) were removed due to multiple NA values:

Helium Neon Argon Xenon Radon Hydrogen
 * Compounds Removed**

Feature Selection
Feature selection was performed using the following R code that removed highly correlated descriptors: code mydata = read.csv(file="Solvent InformationReadyForModeling.csv",head=TRUE,row.names="Key") descrCorr <- cor(mydata) highCorr <- findCorrelation(descrCorr, 0.95, verbose=T) mydata <- mydata[, -highCorr] ncol(mydata) [output] 86 [output] write.csv(mydata,file="SolventInformationFeatureSelected.csv") code
 * 1) remove highly correlated descriptors

Description of Data
The data that we will use to create our RF models now consists of 138 ethanol enthalpy of solvation values with 86 descriptors (MW was added back to the data for informational purposes), see the histograms below: Histogram of Enthalpy of Solvation Data After Curation and Feature Selection Histogram of Enthalpy of Solvation Data After Curation and Feature Selection by Molecular Weight

Principal Component & Cluster Analysis
code library(fpc) library(cluster) mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key") asw <- numeric(20) for (k in 2:20) aswk <- pam(mydata, k) $ silinfo $ avg.width asw [output] [1] 0.0000000 0.4144260 0.4179739 0.3358037 0.3437506 0.3564146 0.3714603 0.3462984 0.3108418 0.3074072 0.3351801 0.3365456 0.3388145 0.3351526 0.3109977 0.3156479 0.3159808 [18] 0.3116679 0.3293435 0.3328914 [output] k.best <- which.max(asw) cat("silhouette-optimal number of clusters:", k.best, "\n")
 * Cluster Analysis**
 * 1) determining the best number of clusters
 * 3

mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key") mydata <- scale(mydata) # standardize variables fit <- kmeans(mydata, 3) #3 cluster solution aggregate(mydata,by=list(fit$cluster),FUN=mean) [output] Group.1 DeltaHsolv     ALogP     ALogp2        AMR    BCUTw.1l    BCUTw.1h 1      1  0.2704986  0.2374628 -0.1573636 -0.2879611  0.05055558  0.04366062 2       2 -1.1648089 -1.3120591  0.7743171  0.9234572 -0.19608056 -0.31448339 3       3 -0.7562661  0.2046194  0.1499057  1.7547322 -0.20620319  0.25735471 [output] mydata <- data.frame(mydata, fit$cluster) write.csv(mydata, file = "clusters.csv") code
 * 1) Cluster Analysis
 * 1) K-Means Cluster Analysis
 * 1) get cluster means
 * 1) append cluster assignment

code mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key") pc1 <- prcomp(mydata, scale. = T) x <- pc1$x summary(pc1)
 * Principal Component Analysis**
 * 1) PCA

[output] Importance of components%s: PC1   PC2    PC3    PC4     PC5     PC6     PC7     PC8     PC9 Standard deviation    4.9562 3.4753 2.9841 2.7209 2.34156 2.04465 1.81481 1.63781 1.54550 Proportion of Variance 0.2823 0.1388 0.1023 0.0851 0.06302 0.04805 0.03786 0.03083 0.02745 Cumulative Proportion 0.2823 0.4212 0.5235 0.6086 0.67163 0.71969 0.75754 0.78838 0.81583 [output]

write.csv(x, file = "PCA.csv") code

The following image depicts the chemical space that will be inputted into R to create the random forest model. We used 2 principal components that explain 42% of the variation and clustered the data into three groups upon the recommendation from the pam function in R. //Chemical Space - Clustered Using R//

//Modeling//
code library("randomForest")
 * 1) Modeling

mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")

smp_size <- floor(0.75 * nrow(mydata))
 * 1) 75% of the sample size

set.seed(123) train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
 * 1) set the seed to make your partition reproductible

train <- mydata[train_ind, ] test <- mydata[-train_ind, ]

mydata.rf <- randomForest(DeltaHsolv ~ ., data = train,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = DeltaHsolv ~ ., data = train, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28

Mean of squared residuals: 95.31999 % Var explained: 82.69 [output] saveRDS(mydata.rf, file = "EnthalpyTrainingSet")

training.predict <- predict(mydata.rf,train) write.csv(training.predict, file = "RFTrainingSetPredict.csv")

test.predict <- predict(mydata.rf,test) write.csv(test.predict, file = "RFTestSetPredict.csv") code

Predicted vs Measured DeltaHsolv for the test-set

The test-set statistics were: AAE 4.76, MSE 39.7, R2 82.9% (We did not include azelaic acid in our calculations).

//Create RF model using all data.// code library("randomForest")

mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")

mydata.rf <- randomForest(DeltaHsolv ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = DeltaHsolv ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28

Mean of squared residuals: 65.72521 % Var explained: 87.47 [output] varImpPlot(mydata.rf,main="Random Forest Variable Importance") saveRDS(mydata.rf, file = "EnthalpyOfSolvation")

training.predict <- predict(mydata.rf,mydata) write.csv(training.predict, file = "AllDataRFPredict.csv") code

The most important descriptors (nHBDon, TopoPSA, khs.sOH, AMR) can be seen on the following plot: //Variable Importance Plot// **Results** We can see... //PCA Colored by AE//