AD004

AD004
A CDK-based random forest model for Abraham descriptors. Researchers: Andrew SID Lang

The Abraham general solvation models can used to predict partition coefficients via the linear fre energy relationships: log P = c + e*E + s*S + a*A + b*B + v*V log K = c + e*E + s*S + a*A + b*B + l*L where c, e, s, a, b, v, and l are the solvent coefficients and E, S, A, B, V, and L and the solute descriptors. Here we use Open Data to build random forest models that can be used to predict E, S, A, B, V, and L.
 * Introduction**

We took the latest available data of compounds with known Abraham descriptors [1] and combined it with compounds with Abraham descriptors calculated on-the-fly from ONSChallenge solubility data [2]. This resulted in a (xlsx) of 3400 rows. CDK descriptors were then calculated using Rajarshi Guha's CDK Desc Calc Gui (v1.3.9). In order to calculate the descriptors, the SMILES of the following compounds had to be fixed due to incorrect aromaticity: 1,2,4-Triazole, 3-nitro-1,2,4-triazole, 2-methylbenzimidazole, 2-ethylanthraquinone, 2-methylimidazole, 4-methylimidazole, 3-methylindole, lansoprazole, benzotriazole, thiabendazole, pyrazole, amitrole, cimetidine, and valsartan. Once the SMILES were fixed, the CDK Descriptor Calculator still failed (not unsurprisingly) to produce descriptors for the following compounds: dihydrooxaphospha phenanthrene oxide, hexamethylhexahydrocyclohexasiloxane, pentamethylpentahydrocyclopentasiloxane, bis(trimethylsiloxy)methylsilane, tris(trimethylsiloxy)silane, phenylsilane, 1,1,3,3-tetramethyldisiloxane, hexamethyldisilazane, and tetraethyllead.
 * Procedure**

R was used to remove rows and columns with all NAs code library("caret") mydata <- read.csv(file="descriptors.csv",head=TRUE,row.names="molID") ncol(mydata) ## 274 columns mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ncol(mydata) ## 209 columns write.csv(mydata,file="descriptorsNoNA.csv") code

The columns corresponding to the descriptors for HybRatio and Kier3 were then deleted due to a significant number of compounds having NA values (22 and 51 respectively). Several compounds (hydrogen selenide, germanium tetrachloride, tetramethyltin, tetraethyltin, arsine) were then removed due to NAs in the columns for the descriptors BCUTw.1l, BCUTw.1h, BCUTc.1l, BCUTc.1h, and BCUTp.1l. Similarly, the following compounds were removed because they had an NA in the topoShape descriptor: water, ammonia, methane, bromine, iodine, phosphine, hydrogen sulfide, and chlorine.

Highly correlated descriptors were then removed using the following code: code library("caret") mydata <- read.csv(file="DescriptorsreadyForFeatureSelection.csv",head=TRUE,row.names="molID") ncol(mydata) ## 274 columns mydata.nzv <- nearZeroVar(mydata) mydata <- mydata[, -mydata.nzv] ncol(mydata) ## 119 cols descrCorr <- cor(mydata); highCorr <- findCorrelation(descrCorr, 0.90, verbose=T); mydata <- mydata[, -highCorr]; ncol(mydata) ## 76 columns write.csv(mydata,file="DescriptorsFeatureSelected.csv") code
 * 1) remove all columns with near zero variance
 * 1) remove highly correlated descriptors

After curation, we were left with a (xlsx: 3376 measurements, 76 descriptors). The data was then split into separate files for corresponding to each Abraham descriptor:, , , , , and (csv), see Table 1. In all files, cells with "-123" correspond to no measurement for that particular descriptor. For duplicate compounds (by CSID), only the most recent (by sortdate) values were kept. Random forest models were then creates for each descriptor using R. For example, the random forest model for E was created using the following code. code library("randomForest") mydata <- read.csv(file="20141007E.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(E ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = E ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 25
 * Table 1. Descriptive statistics.**
 * Descriptor || Number (N) || Min Value || Max Value || Mean Value || SD ||
 * E || 2217 || -1.137 || 5.691 || 0.915 || 0.788 ||
 * S || 2287 || -5.567 || 9.105 || 1.029 || 0.733 ||
 * A || 2287 || 0.000 || 9.368 || 0.182 || 0.382 ||
 * B || 2274 || 0.000 || 5.459 || 0.513 || 0.505 ||
 * V || 2287 || 0.2026 || 6.863 || 1.336 || 0.683 ||
 * L || 2011 || -0.836 || 28.94 || 6.084 || 3.128 ||

Mean of squared residuals: 0.02583414 % Var explained: 95.84 [output] varImpPlot(mydata.rf,main="Random Forest Variable ImporOpen PhysProp Shiny App tance") saveRDS(mydata.rf, file = "ADE004") code
 * 1) get variable importance plot

The most important descriptors (TopoPSA, nHBDon, BCUTw.1h) can be seen on the following plot.

Results
Random forest models were created for all descriptors. In general, the models perform very well. The model for S could be better and the model for A is poor. This may be due to the fact that A is taken to be zero if the compound has zero hydrogen bond donors. The models can be downloaded by clicking on the link in Table 2 below. The models are included in the [|Open PhyProp Shiny App].


 * Table 2. Summary of Results. Statistics are for OOB.***
 * Model || RMSE || R2 || Important Descriptors ||
 * [[file:ADE004|E]] || 0.161 || 95.84 || TopoPSA, nHBDon, BCUTw.1h ||
 * [[file:ADS004|S]] || 0.347 || 77.57 || WTPT.5, nAtomP, TopoPSA ||
 * [[file:ADA004|A]] || 0.286 || 43.85 || nHBDon, khs.sNH2, WTPT.5 ||
 * [[file:ADB004|B]] || 0.147 || 91.58 || nHBAcc, TopoPSA, XLogP ||
 * [[file:ADV004|V]] || 0.096 || 98.03 || WPATH, fragC, BCUTp.1h ||
 * [[file:ADL004|L]] || 0.580 || 96.56 || ATSp4, BCUTp.1l, WPATH ||