EOS002

Enthalpy of Solvation
Researchers: Brandon J Jaquis, Andrew SID Lang, Ailin Li, Nolan D Monnier License: CC0

Data
We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.

Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.8) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens and exported as a .csv file:.

Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA" using the following code. code setwd("C://..")

mydata = read.csv(file="CDKDescriptors.csv",header=TRUE,row.names="molID")

ncol(mydata) #get number of columns [output] 287 [output]

mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA" ncol(mydata) [output] 222 [output] write.csv(mydata,file="CDKDescriptorsNoNA.csv") code Here's the resulting file:. Then we used the caret package to remove columns with all the same value. code library(caret) ##for feature selection nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns mydata <- mydata[, -nzv] ncol(mydata) [output] 120 [output] write.csv(mydata,file="DescriptorsNoNZV.csv") code Here's the resulting file:. Opening the resulting file, we see that there are still a few "NA" values. Several rows molID: 57-62 (Helium, Neon, Argon, Xenon, Radon, Hydrogen) were removed due to multiple NA values. Several columns that had multiple "NA" values were also removed (HybRatio, Kier3, topoShape). This leaves us with a set of 138 deltaH values and 116 CDK descriptors:.

A correlation matrix was built, and used to eliminate columns with a high covariance. Between two columns of high covariance, we removed the column with the least uniqueness to other columns. code mydata = read.csv(file="DescriptorsNoNZVCleaned.csv",head=TRUE,row.names="molID") descrCorr <- cor(mydata) highCorr <- findCorrelation(descrCorr, 0.95, verbose=T) mydata <- mydata[, -highCorr] ncol(mydata) [output] 86 [output] write.csv(mydata,file="SolventInformationFeatureSelected.csv") code Here is the resulting file:
 * Feature Selection**
 * 1) remove highly correlated descriptors

Chemical Space
Like in the previous year, the data contains 138 molecules with 86 descriptors. This shows an important level of consistency between the chemical spaces. Thus, the histogram of molecular weights is also the same:



To better visualize the chemical space, the data set is projected using PCA. We also clustered the data to get a better understanding of the data spread. code library(fpc) library(cluster) mydata <-read.csv(file="SolventInformationFeatureSelected.csv", header=TRUE, row.names="molID")

pc <- prcomp(mydata, scale. = T)   p <- pc$x summary(pc) [output] Importance of components%s: PC1   PC2    PC3    PC4     PC5     PC6     PC7     PC8     PC9 Standard deviation    4.8857 3.4874 2.9531 2.68900 2.36460 2.04054 1.80613 1.64269 1.5379 Proportion of Variance 0.2776 0.1414 0.1024 0.08408 0.06502 0.04842 0.03793 0.03138 0.0275 Cumulative Proportion 0.2776 0.4190 0.5204 0.60446 0.66947 0.71789 0.75582 0.78720 0.8147 [output]
 * PCA

mydataS <- scale(mydata) asw <- numeric(5) #determining K       for (k in 2:5){ aswk <- pam(mydataS, k) $ silinfo $ avg.width }   k.best <- which.max(asw) k.best [output] 2 [output] fit<-kmeans(mydataS,k.best) #clustering
 * 1) K-means clustering

p<-cbind(p,fit$cluster) #Writing as one file write.csv(p, file = "pca.csv") #Saving code The following picture is the k-means clustered data points graphed over PC1&2, and PC1&3. Unlike the past year, we scaled before selecting the number of clusters, resulting in 1 less cluster. The program appeared to cluster between the main group (containing 84.1% of the data) and a secondary outlier group. Here is the resulting file:
 * 1) Saving Results

Modeling
This is first model we created. No particular conditions were set for the randomForest function. Our sample size was 75% with a seed of 123. code library("randomForest") mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID") smp_size <- floor(0.75 * nrow(mydata)) set.seed(123) train_ind <- sample(seq_len(nrow(mydata)), size = smp_size) train <- mydata[train_ind, ] test <- mydata[-train_ind, ] mydata.rf <- randomForest(deltaH ~ ., data = train,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = deltaH ~ ., data = train, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28
 * 1) 75% of the sample size
 * 1) set the seed to make your partition reproductible

Mean of squared residuals: 88.37216 % Var explained: 83.95 [output] saveRDS(mydata.rf, file = "EnthalpyTraining Set") #Stores File varImpPlot(mydata.rf, sort = T, main="Variable Importance")#Displays Importance Plot code The actual R model is contained in this file:. Here is the Variable Importance Plot. Again we see nHBDon, ohs.sOH, TopoPSA are the three most important variables, although their order has changed. code training.predict <- predict(mydata.rf,train) #Predicts train set cluster.train <- fit$cluster[train_ind] Training <- cbind(train, training.predict, cluster.train) #Combines all train data into one file write.csv(Training, file = "RFTrainingSetPredict.csv") #Saves file
 * 1) Crossvalidation

test.predict <- predict(mydata.rf,test) #Predicts test set cluster.test <- fit$cluster[-train_ind] Test <- cbind(test, test.predict, cluster.test) #Combines all test set into one file write.csv(Test, file = "RFTestSetPredict.csv") #Saves file code Here are the files:,. Our test set R^2 value is 0.887. Our test set RMSE 7.0988. For our individual clusters: Cluster 1: RMSE: 6.558 Cluster 2: RMSE 9.737 Our file, with residuals and crossvalidation statistics is. Our chemical space with residuals is also shown below:

H2O Deep Learning Model
As an alternative to the random forest model, we built an H2O Neural Network to try and create an even stronger model.To directly compared the results with the RF model, the same train and test set was used. code library(h2o) h2o.init

mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID")

smp_size <- floor(0.75 * nrow(mydata)) set.seed(123) train_ind <- sample(seq_len(nrow(mydata)), size = smp_size) train <- mydata[train_ind, ] test <- mydata[-train_ind, ] trainH2O<-as.h2o(train) testH2O<-as.h2o(test) code
 * 1) 75% of the sample size
 * 1) set the seed to make your partition reproductible,,
 * 1) Prepare data for H2O

Because the layer width is a critical component for accuracy, the ten models were created with the layer width varied. The process was repeated twice before selection, after which 10 more precise layer width were observed. It is important to note that the model was fine-tuned for the seed we were using. code residuals <- function(f,t){(f-t)^2} #functions for determining accuracy R2 <- function(f,t){ tavg<-mean(t) SSres<-sum(residuals(f,t)) SStot<-sum((t-tavg)^2) output<-(1-sum(SSres)/SStot) print(output) } for (i in 1:10){ hNeuralNet<-h2o.deeplearning(y="deltaH",                                training_frame=trainH2O,                                  model_id="hNeuralNet",                                  hidden=c((100*i),(100*i)),                                  standardize=TRUE,                                  seed=123,                                  max_runtime_secs=300,                                  variable_importances=TRUE)

#cross validation predictTrain<-predict(hNeuralNet,trainH2O) predictTest<-predict(hNeuralNet,testH2O) R2(predictTrain,trainH2O$deltaH) R2(predictTest,testH2O$deltaH) } for (i in 1:10){ hNeuralNet<-h2o.deeplearning(y="deltaH",                                 training_frame=trainH2O,                                  model_id="hNeuralNet",                                  hidden=c((50+10*i),(50+10*i)),                                  standardize=TRUE,                                  seed=123,                                  max_runtime_secs=300,                                  variable_importances=TRUE)
 * 1) Consistently high marks at 100

#cross validation predictTrain<-predict(hNeuralNet,trainH2O) predictTest<-predict(hNeuralNet,testH2O) R2(predictTrain,trainH2O$deltaH) R2(predictTest,testH2O$deltaH) } code

A hidden layer width of 130 was selected. The data was standardized to facilitate H2O, and the seed was set to make the results reproducible. The final model is shown below: code hNeuralNet<-h2o.deeplearning(y="deltaH",                            training_frame=trainH2O,                             model_id="hNeuralNet",                             standardize=TRUE,                             seed=123,                             hidden=c(130,130),                             variable_importances=TRUE) summary(hNeuralNet) [output] H2ORegressionMetrics: deeplearning Reported on training data. **Metrics reported on full training frame**

MSE: 27.54657 RMSE: 5.248483 MAE: 3.770966 RMSLE: NaN Mean Residual Deviance : 27.54657

Variable Importances: variable relative_importance scaled_importance percentage 1 nAtomP 1.000000 1.000000 0.013171 2 SPC.4 0.967459 0.967459 0.012743 3 khs.sssCH 0.964586 0.964586 0.012705 4 MLogP 0.961444 0.961444 0.012664 5 ATSm1 0.956220 0.956220 0.012595 --- 80 nRotB 0.829468 0.829468 0.010925 81 SCH.7 0.827819 0.827819 0.010904 82 VCH.7 0.815953 0.815953 0.010747 83 C2SP2 0.801603 0.801603 0.010558 84 TopoPSA 0.798668 0.798668 0.010520 85 nAromBlocks 0.781421 0.781421 0.010292 [output]

predictTrain<-predict(hNeuralNet,trainH2O) predictTest<-predict(hNeuralNet,testH2O)
 * 1) Saving information

resT <- residuals(predictTest,testH2O$deltaH) resTr <- residuals(predictTrain,trainH2O$deltaH)

TEST <- cbind(as.data.frame(testH2O),as.data.frame(predictTest),as.data.frame(resT)) TRAIN <- cbind(as.data.frame(trainH2O),as.data.frame(predictTrain),as.data.frame(resTr)) #cluster information will be added in manually write.csv(TEST, file="NNTestSetPredict.csv") write.csv(TRAIN, file="NNTrainSetPredict.csv") h2o.saveModel(object=hNeuralNet, path=getwd, force=TRUE) code Here are the resulting files:, , , Our test set R^2 value is 0.903, an increase of 1.8% relative to RF. Our test set RMSE is 6.601, a 7.0% reduction in error. For our individual clusters: Cluster 1: RMSE: 5.13 Cluster 2: RMSE 7.43 Our file, with crossvalidation statistics is. Our chemical space with residuals is also shown below: