Modeling the Solubility of Organic Compounds in 1-Octanol

Researchers: Michael A Buonaiuto and Andrew SID Lang

Procedure

We extracted all octanol solubility data from the open notebook science challenge solubility database. Octanol Data From ONS.csv

After removing all items marked “DONOTUSE” and obvious outliers, we looked for entries with multiple values.

For solutes with multiple solubility values, we kept only the ones listed in the Abraham paper. [1] If there was no Abraham paper value, we kept the RAEVSKY OA value. In the rare case that there were two Abraham paper (or RAEVSKY OA [2]) values for a single chemspider ID, we kept the higher of the two. This resulted in a file of 261 data points. Refined Octanol Data From ONS.csv.

Descriptors

We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.6) to calculated the CDK descriptors for all the compounds in our refined data file. We selected the option to add explicit hydrogens. Solubility Data with CDK Descriptors.Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA."

mydata = read.csv(file="20150728InputForFeatureSelection.csv",head=TRUE,row.names="csid")
 
ncol(mydata) #get number of columns
[output]
287
[output]
library(caret) ##for feature selection
 
mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA"
ncol(mydata)
[output]
222
[output]
write.csv(mydata,file="DescriptorsNoNA.csv")
 
 
nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns
mydata <- mydata[, -nzv]
ncol(mydata)
[output]
143
[output]
write.csv(mydata,file="DescriptorsNoNZV.csv")

After the above code was executed, two rows that had multiple NA entries were also deleted in order to remove highly correlated descriptors as part of the feature selection process.

Feature Selection

Feature selection was performed using the following R code that removed highly correlated descriptors:

mydata = read.csv(file="DescriptorsNoNZV.csv",head=TRUE,row.names="csid")
descrCorr <- cor(mydata);
highCorr <- findCorrelation(descrCorr, 0.90, verbose=T);
##remove highly correlated descriptors
mydata <- mydata[, -highCorr];
ncol(mydata)
[output]
87
[output]
write.csv(mydata,file="DescriptorsFeatureSelected.csv")

Description of Data

The data that we will use to create our RF models now consists of 259 1-octanol solubility values with 86 descriptors, see the histogram below:

20150728InputHistogram.png
Histogram of Solubility Data After Curation and Feature Selection
20150728InputHistogramMW.png
Histogram of Solubility Data After Curation and Feature Selection by Molecular Weight

Principal Component & Cluster Analysis

Cluster Analysis
## determining the best number of clusters
library(fpc)
library(cluster)
mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")
asw <- numeric(20)
for (k in 2:20)
  asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width
asw
[output]
[1] 0.0000000 0.7352926 0.4876591 0.3789633 0.4052174 0.4129726 0.4262371 0.4391316 0.4286310 0.3785256 0.3851924
[12] 0.3942103 0.4035164 0.4044112 0.3437240 0.3447198 0.3432463 0.3398725 0.3454049 0.3415452
[output]
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
# 2
 
##Cluster Analysis
mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")
mydata <- scale(mydata) # standardize variables
# K-Means Cluster Analysis
fit <- kmeans(mydata, 2) # 2 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
[output]
Too Much To Display
[output]
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
write.csv(mydata, file = "clusters.csv")

Principal Component Analysis
##PCA
mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")
pc1 <- prcomp(mydata, scale. = T)
x <- pc1$x
summary(pc1)
 
[output]
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7     PC8     PC9    PC10    PC11    PC12
Standard deviation     4.0922 3.8658 3.2466 2.5916 2.32777 1.99701 1.72687 1.50001 1.43944 1.43011 1.33050 1.29421
Proportion of Variance 0.1925 0.1718 0.1212 0.0772 0.06228 0.04584 0.03428 0.02586 0.02382 0.02351 0.02035 0.01925
Cumulative Proportion  0.1925 0.3643 0.4854 0.5626 0.62490 0.67074 0.70501 0.73088 0.75469 0.77820 0.79855 0.81780
[output]
 
write.csv(x, file = "PCA.csv")

The following image depicts the chemical space that will be inputted into R to create the random forest model. We used 2 principal components that explain 36% of the variation and clustered the data into two group upon the recommendation from the pam function in R. Using DMax Chemistry assistant, we see the two groups are separated by nHBAcc and TopoPSA with the red cluster corresponding to compounds with zero hydrogen bond acceptors and with a topological polar surface area less than 26.48 angstroms. The red cluster is typified by compounds without hydrogen bond acceptors and with ALogP > 1.56 and with TopoPSA < 26.48; 128 out of 157 compounds match this criteria. The blue cluster is more chemically diverse than the red cluster but even so 75 of the 102 compounds have ALogP < 1.56 and TopoPSA > 26.48 and at least one hydrogen bond acceptor.

Overall the dataset consists of small (typically 100 < MW < 450) druglike compounds (50 compounds have one Lipinski Failure) that are solid at room temperature with 1-octanol solubilities typically between 0.01 M and 1M . Looking at the nAcid descriptor, we see that 49 compounds are carboxylic acids.

ChemicalSpaceClusterAnalysis.png

Chemical Space - Clustered Using R

Modeling

##Modeling
library("randomForest")
setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments")
mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")
 
 
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
 
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
 
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
 
mydata.rf <- randomForest(LogS ~ ., data = train,importance = TRUE)
print(mydata.rf)
[output]
Call:
 randomForest(formula = LogS ~ ., data = train, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 28
 
          Mean of squared residuals: 0.3800402
                    % Var explained: 62.66
[output]
saveRDS(mydata.rf, file = "OctanolTrainingSet")
 
training.predict <- predict(mydata.rf,train)
write.csv(training.predict, file = "RFTrainingSetPredict.csv")
 
test.predict <- predict(mydata.rf,test)
write.csv(test.predict, file = "RFTestSetPredict.csv")

20150728PredictedVSMeasuredTest.png

Predicted vs Measured 1-octanol solubility for the test-set

The test-set statistics were: AAE 0.53, MSE 0.44, R2 0.54

Create RF model using all data.
library("randomForest")
setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments")
mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")
mydata.rf <- randomForest(LogS ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
 randomForest(formula = LogS ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 28
 
          Mean of squared residuals: 0.3435526
                    % Var explained: 65.7
[output]
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
saveRDS(mydata.rf, file = "Octanol1")
 
training.predict <- predict(mydata.rf,mydata)
write.csv(training.predict, file = "FullRFTrainingSetPredict.csv")

The most important descriptors (ALogP, XLogP, TopoPSA, nAtomP, MDEC.23, khs,aaCH, nHBAcc) can be seen on the following plot:

20150728RFVARIMP.png
Variable Importance Plot

Note: The correlations between the ALogP and XLogP descriptors = 0.83. If removing all descriptors with correlations >=0.83 was done, then only 70 descriptors would remain instead of 87.

Results

20150728PCAWithAE.png
PCA Colored by AE

The following results are an improvement of the model by Abraham and Acree who report a Training-Set R2 of 0.83

RF Out of Bag (Training Set)
Mean Square Error: 0.38
R2: 0.63

Test-Set
Mean Square Error: 0.44
R2: 0.54

Full RF Out of Bag
Mean Square Error: 0.34
R2: 0.66

Full RF Training-Set
Mean Square Error: 0.06
R2: 0.94

Performance Issues

Using DMax to find where the model does well and where is does not do well. The only significant factor is LogS itself. This may be due to the quality of data for low solubilities. When the MSE and R2 values are calculated for compunds both above and below 0.01 M, we see that for compounds with LogS <= -0.2 the MSE is 0.13 and for compounds with LogS > -0.2 the MSE is 0.05. By PC-distance - no relationship, unlike what we saw previously for Abraham solvent coefficients:
http://journal.chemistrycentral.com/content/9/1/12


20150728AEvsMeasured.png

References

1. Abraham, Michael H., and William E. Acree. The solubility of liquid and solid compounds in dry octan-1-ol. Chemosphere 103 (2014): 26-34. doi: dx.doi.org/10.1016/j.chemosphere.2013.10.095
2. Raevsky, O.A., Perlovich, G.L., Schaper, K.-J., 2007. Physicochemical properties/descriptors governing the solubility and partitioning of chemicals in water-solvent gas systems. Part 2. Solubility in 1-octanol. SAR QSAR Environ. Res. 18, 543–578. doi: 10.1080/10629360701430124