CDK001

Modeling the Solubility of Organic Compounds in 1-Octanol
Researchers: Michael A Buonaiuto and Andrew SID Lang

Procedure
We extracted all octanol solubility data from the open notebook science challenge solubility database.

After removing all items marked “DONOTUSE” and obvious outliers, we looked for entries with multiple values.

For solutes with multiple solubility values, we kept only the ones listed in the Abraham paper. [1] If there was no Abraham paper value, we kept the RAEVSKY OA value. In the rare case that there were two Abraham paper (or RAEVSKY OA [2]) values for a single chemspider ID, we kept the higher of the two. This resulted in a file of 261 data points. .

Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.6) to calculated the CDK descriptors for all the compounds in our refined data file. We selected the option to add explicit hydrogens. .Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA."

code mydata = read.csv(file="20150728InputForFeatureSelection.csv",head=TRUE,row.names="csid")

ncol(mydata) #get number of columns [output] 287 [output] library(caret) ##for feature selection

mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA" ncol(mydata) [output] 222 [output] write.csv(mydata,file="DescriptorsNoNA.csv")

nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns mydata <- mydata[, -nzv] ncol(mydata) [output] 143 [output] write.csv(mydata,file="DescriptorsNoNZV.csv") code

After the above code was executed, two rows that had multiple NA entries were also deleted in order to remove highly correlated descriptors as part of the feature selection process.

Feature Selection
Feature selection was performed using the following R code that removed highly correlated descriptors:

code mydata = read.csv(file="DescriptorsNoNZV.csv",head=TRUE,row.names="csid") descrCorr <- cor(mydata); highCorr <- findCorrelation(descrCorr, 0.90, verbose=T); mydata <- mydata[, -highCorr]; ncol(mydata) [output] 87 [output] write.csv(mydata,file="DescriptorsFeatureSelected.csv") code
 * 1) remove highly correlated descriptors

Description of Data
The data that we will use to create our RF models now consists of 259 1-octanol solubility values with 86 descriptors, see the histogram below:

Histogram of Solubility Data After Curation and Feature Selection Histogram of Solubility Data After Curation and Feature Selection by Molecular Weight

Principal Component & Cluster Analysis
code library(fpc) library(cluster) mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") asw <- numeric(20) for (k in 2:20) aswk <- pam(mydata, k) $ silinfo $ avg.width asw [output] [1] 0.0000000 0.7352926 0.4876591 0.3789633 0.4052174 0.4129726 0.4262371 0.4391316 0.4286310 0.3785256 0.3851924 [12] 0.3942103 0.4035164 0.4044112 0.3437240 0.3447198 0.3432463 0.3398725 0.3454049 0.3415452 [output] k.best <- which.max(asw) cat("silhouette-optimal number of clusters:", k.best, "\n")
 * Cluster Analysis**
 * 1) determining the best number of clusters
 * 2

mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") mydata <- scale(mydata) # standardize variables fit <- kmeans(mydata, 2) # 2 cluster solution aggregate(mydata,by=list(fit$cluster),FUN=mean) [output] Too Much To Display [output] mydata <- data.frame(mydata, fit$cluster) write.csv(mydata, file = "clusters.csv") code
 * 1) Cluster Analysis
 * 1) K-Means Cluster Analysis
 * 1) get cluster means
 * 1) append cluster assignment

code mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") pc1 <- prcomp(mydata, scale. = T) x <- pc1$x summary(pc1)
 * Principal Component Analysis**
 * 1) PCA

[output] Importance of components: PC1   PC2    PC3    PC4     PC5     PC6     PC7     PC8     PC9    PC10    PC11    PC12 Standard deviation    4.0922 3.8658 3.2466 2.5916 2.32777 1.99701 1.72687 1.50001 1.43944 1.43011 1.33050 1.29421 Proportion of Variance 0.1925 0.1718 0.1212 0.0772 0.06228 0.04584 0.03428 0.02586 0.02382 0.02351 0.02035 0.01925 Cumulative Proportion 0.1925 0.3643 0.4854 0.5626 0.62490 0.67074 0.70501 0.73088 0.75469 0.77820 0.79855 0.81780 [output]

write.csv(x, file = "PCA.csv") code

The following image depicts the chemical space that will be inputted into R to create the random forest model. We used 2 principal components that explain 36% of the variation and clustered the data into two group upon the recommendation from the //pam// function in R. Using DMax Chemistry assistant, we see the two groups are separated by nHBAcc and TopoPSA with the red cluster corresponding to compounds with zero hydrogen bond acceptors and with a topological polar surface area less than 26.48 angstroms. The red cluster is typified by compounds without hydrogen bond acceptors and with ALogP > 1.56 and with TopoPSA < 26.48; 128 out of 157 compounds match this criteria. The blue cluster is more chemically diverse than the red cluster but even so 75 of the 102 compounds have ALogP < 1.56 and TopoPSA > 26.48 and at least one hydrogen bond acceptor.

Overall the dataset consists of small (typically 100 < MW < 450) druglike compounds (50 compounds have one Lipinski Failure) that are solid at room temperature with 1-octanol solubilities typically between 0.01 M and 1M. Looking at the nAcid descriptor, we see that 49 compounds are carboxylic acids.



Chemical Space - Clustered Using R

Modeling
code library("randomForest") setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments") mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")
 * 1) Modeling

smp_size <- floor(0.75 * nrow(mydata))
 * 1) 75% of the sample size

set.seed(123) train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
 * 1) set the seed to make your partition reproductible

train <- mydata[train_ind, ] test <- mydata[-train_ind, ]

mydata.rf <- randomForest(LogS ~ ., data = train,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = LogS ~ ., data = train, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28

Mean of squared residuals: 0.3800402 % Var explained: 62.66 [output] saveRDS(mydata.rf, file = "OctanolTrainingSet")

training.predict <- predict(mydata.rf,train) write.csv(training.predict, file = "RFTrainingSetPredict.csv")

test.predict <- predict(mydata.rf,test) write.csv(test.predict, file = "RFTestSetPredict.csv") code



Predicted vs Measured 1-octanol solubility for the test-set

The test-set statistics were: AAE 0.53, MSE 0.44, R2 0.54

Create RF model using all data. code library("randomForest") setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments") mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") mydata.rf <- randomForest(LogS ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = LogS ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28

Mean of squared residuals: 0.3435526 % Var explained: 65.7 [output] varImpPlot(mydata.rf,main="Random Forest Variable Importance") saveRDS(mydata.rf, file = "Octanol1")

training.predict <- predict(mydata.rf,mydata) write.csv(training.predict, file = "FullRFTrainingSetPredict.csv") code

The most important descriptors (ALogP, XLogP, TopoPSA, nAtomP, MDEC.23, khs,aaCH, nHBAcc) can be seen on the following plot: Variable Importance Plot Note: The correlations between the ALogP and XLogP descriptors = 0.83. If removing all descriptors with correlations >=0.83 was done, then only 70 descriptors would remain instead of 87.

**Results** PCA Colored by AE The following results are an improvement of the model by Abraham and Acree who report a Training-Set R2 of 0.83

RF Out of Bag (Training Set) Mean Square Error: 0.38 R2: 0.63

Test-Set Mean Square Error: 0.44 R2: 0.54

Full RF Out of Bag Mean Square Error: 0.34 R2: 0.66

Full RF Training-Set Mean Square Error: 0.06 R2: 0.94

Performance Issues
Using DMax to find where the model does well and where is does not do well. The only significant factor is LogS itself. This may be due to the quality of data for low solubilities. When the MSE and R2 values are calculated for compunds both above and below 0.01 M, we see that for compounds with LogS <= -0.2 the MSE is 0.13 and for compounds with LogS > -0.2 the MSE is 0.05. By PC-distance - no relationship, unlike what we saw previously for Abraham solvent coefficients: http://journal.chemistrycentral.com/content/9/1/12