AD004

A CDK-based random forest model for Abraham descriptors.
Researchers: Andrew SID Lang

Introduction
The Abraham general solvation models can used to predict partition coefficients via the linear fre energy relationships:
log P = c + e*E + s*S + a*A + b*B + v*V
log K = c + e*E + s*S + a*A + b*B + l*L
where c, e, s, a, b, v, and l are the solvent coefficients and E, S, A, B, V, and L and the solute descriptors.
Here we use Open Data to build random forest models that can be used to predict E, S, A, B, V, and L.

Procedure
We took the latest available data of compounds with known Abraham descriptors [1] and combined it with compounds with Abraham descriptors calculated on-the-fly from ONSChallenge solubility data [2]. This resulted in a dataset (xlsx) of 3400 rows. CDK descriptors were then calculated using Rajarshi Guha's CDK Desc Calc Gui (v1.3.9). In order to calculate the descriptors, the SMILES of the following compounds had to be fixed due to incorrect aromaticity: 1,2,4-Triazole, 3-nitro-1,2,4-triazole, 2-methylbenzimidazole, 2-ethylanthraquinone, 2-methylimidazole, 4-methylimidazole, 3-methylindole, lansoprazole, benzotriazole, thiabendazole, pyrazole, amitrole, cimetidine, and valsartan. Once the SMILES were fixed, the CDK Descriptor Calculator still failed (not unsurprisingly) to produce descriptors for the following compounds: dihydrooxaphospha phenanthrene oxide, hexamethylhexahydrocyclohexasiloxane, pentamethylpentahydrocyclopentasiloxane, bis(trimethylsiloxy)methylsilane, tris(trimethylsiloxy)silane, phenylsilane, 1,1,3,3-tetramethyldisiloxane, hexamethyldisilazane, and tetraethyllead.

R was used to remove rows and columns with all NAs
library("caret")
mydata <- read.csv(file="descriptors.csv",head=TRUE,row.names="molID")
ncol(mydata) ## 274 columns
mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)]
ncol(mydata) ## 209 columns
write.csv(mydata,file="descriptorsNoNA.csv")

The columns corresponding to the descriptors for HybRatio and Kier3 were then deleted due to a significant number of compounds having NA values (22 and 51 respectively). Several compounds (hydrogen selenide, germanium tetrachloride, tetramethyltin, tetraethyltin, arsine) were then removed due to NAs in the columns for the descriptors BCUTw.1l, BCUTw.1h, BCUTc.1l, BCUTc.1h, and BCUTp.1l. Similarly, the following compounds were removed because they had an NA in the topoShape descriptor: water, ammonia, methane, bromine, iodine, phosphine, hydrogen sulfide, and chlorine.

Highly correlated descriptors were then removed using the following code:
library("caret")
mydata <- read.csv(file="DescriptorsreadyForFeatureSelection.csv",head=TRUE,row.names="molID")
ncol(mydata) ## 274 columns
## remove all columns with near zero variance
mydata.nzv <- nearZeroVar(mydata)
mydata <- mydata[, -mydata.nzv]
ncol(mydata) ## 119 cols
descrCorr <- cor(mydata);
highCorr <- findCorrelation(descrCorr, 0.90, verbose=T);
## remove highly correlated descriptors
mydata <- mydata[, -highCorr];
ncol(mydata) ## 76 columns
write.csv(mydata,file="DescriptorsFeatureSelected.csv")

After curation, we were left with a dataset ready for modeling (xlsx: 3376 measurements, 76 descriptors). The data was then split into separate files for corresponding to each Abraham descriptor: E, S, A, B, V, and L (csv), see Table 1. In all files, cells with "-123" correspond to no measurement for that particular descriptor. For duplicate compounds (by CSID), only the most recent (by sortdate) values were kept.
Table 1. Descriptive statistics.
Descriptor
Number (N)
Min Value
Max Value
Mean Value
SD
E
2217
-1.137
5.691
0.915
0.788
S
2287
-5.567
9.105
1.029
0.733
A
2287
0.000
9.368
0.182
0.382
B
2274
0.000
5.459
0.513
0.505
V
2287
0.2026
6.863
1.336
0.683
L
2011
-0.836
28.94
6.084
3.128
Random forest models were then creates for each descriptor using R. For example, the random forest model for E was created using the following code.
library("randomForest")
mydata <- read.csv(file="20141007E.csv",head=TRUE,row.names="molID")
mydata.rf <- randomForest(E ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
 randomForest(formula = E ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 25
 
          Mean of squared residuals: 0.02583414
                    % Var explained: 95.84
[output]
## get variable importance plot
varImpPlot(mydata.rf,main="Random Forest Variable Impor[[@https://onsmodels.shinyapps.io/OpenPhysProp/|Open PhysProp Shiny App]] tance")
saveRDS(mydata.rf, file = "ADE004")

The most important descriptors (TopoPSA, nHBDon, BCUTw.1h) can be seen on the following plot.
20141007varimpE.png

Results

Random forest models were created for all descriptors. In general, the models perform very well. The model for S could be better and the model for A is poor. This may be due to the fact that A is taken to be zero if the compound has zero hydrogen bond donors. The models can be downloaded by clicking on the link in Table 2 below. The models are included in the Open PhyProp Shiny App.

*Table 2. Summary of Results. Statistics are for OOB.*
Model
RMSE
R2
Important Descriptors
E
0.161
95.84
TopoPSA, nHBDon, BCUTw.1h
S
0.347
77.57
WTPT.5, nAtomP, TopoPSA
A
0.286
43.85
nHBDon, khs.sNH2, WTPT.5
B
0.147
91.58
nHBAcc, TopoPSA, XLogP
V
0.096
98.03
WPATH, fragC, BCUTp.1h
L
0.580
96.56
ATSp4, BCUTp.1l, WPATH

References

1. Bradley, Jean-Claude; Acree, William; Lang, Andrew (2014): Compounds with known Abraham descriptors. figshare: http://dx.doi.org/10.6084/m9.figshare.1176994 . Retrieved 15:48, Oct 17, 2014 (GMT)
2. Bradley, Jean-Claude. Abraham Descriptors from ONSChallenge solubility data. Retrieved 15:48, Oct 17, 2014 (GMT)