MeltingPointModel011

Modeling Melting Points Using Open Babel

 * Researchers: Jean-Claude Bradley, Andrew Lang, Tony Williams, //et. al.//**


 * [This page is a duplicate (backup) of the original model page on the ONSChallange wiki] **

Introduction
Fragment-based models are, in general, less accurate than more sophisticated nonlinear models, such as random forests. However, fragment-based models, because of their speed, are useful for approximating properties for large compound libraries. Fragment-based models are also smaller, easier to deploy, and are open to better chemical interpretability than nonlinear models. Inspired by a blog post by OpenBabel developer Noel O'Boyle on how to deploy SMARTS fragment-based models using OpenBabel, we developed an open (CC0) fragment-based melting point model that can be incorporated into OpenBabel which had an R2: 0.61, AAE: 43.6, and RMSE: 58.1 when used to predict the melting points of a 3500 compound test-set. While our results are comparable to a previously published non-reproducible fragment-based model using the problematic PHYPROP database (659 compound test-set R2: 0.62, AAE: 48.9, RMSE: 48.8) [1], they pale in comparison to our current nonlinear models, which have an OOB R2: 0.83 and RMSE: 38.1.

Procedure
With the lack of a dedicated open fragment calculator for QSPR modeling, we decided to use OpenBabel itself as the modeling platform. We began by collecting a by taking the union of the fragments from the current fragment-based models deployed in Open Babel (psa, logp, mr). We then used OpenBabel to calculate the fragment counts for the same 16015 compound training set used in MPModel007, by first assigning a value of 1 to all descriptors to get a total count, and then for each descriptor - one at a time - setting a value of zero (with all the rest one). By subtracting, we get a count for each descriptor. Doing this 167 times gives the needed fragment counts. I needed to do it this way because in OpenBabel: "Where several SMARTS strings assign values to the same atom, only the final assignment is retained." [2]

We removed the surprisingly large number of fragments with counts less than 5 ([#1], [#50], [O][#14], [O][#33], [O][#50], [O]C=[#7], [O]C=S, [*], [#9-*], [#17-*], [#35-*], [#53-*], [#53+*], [Fe,Cu,Zn,Tc,Cd,Pt,Au,Hg], [CH4], [CH2](C)C, [CH](C)(C)C, [CH3][O,N,F,Cl,Br,#15,#16,#53;!a], [CH2X4][O,N,F,Cl,Br,#15,#16,#53;!a], [CHX4][O,N,F,Cl,Br,#15,#16,#53;!a], [CH3]c, [CH3][a!#6], [CH2X4]a, [CHX4]a, [CX4][!#6;!#7;!#8;!#9;!#15;!#16;!#17;!#35;!#53;!#1], [N]#*, [N](=*)#*, [NH]1-*-*1, [N+](-*)(-*)(-*)-*, [N+](-*)(-*)=*, [N+](-*)#*, [NH+](-*)(-*)-*, [NH+](-*)=*, [NH2+](-*)-*, [NH2+]=*, [NH3+]-*, [n](:*):*, [n](:*)(:*):*, [n](-*)(:*):*, [n](=*)(:*):*, [nH](:*):*, [n+](:*)(:*):*, [n+](-*)(:*):*, [nH+](:*):*, [N+0](=a)a, [N+0](a)(a)a, [NH3+*], [NH2+*], [NH+*], [n+*], [NH0+*](=A)(A)a, [NH0+*](=[#6])=[#7], [N+*]#A, [N-*], [N+*](=[N-*])=N, [#8], [O](-*)-*, [O]1-*-*1, [OH]-*, [O-]-*, [o](:*):*, [OH2], [O]=[#8], [OX1-*][#7], [OX1-*][#16], [OX1-*][#15;#33;#43;#53], [O]=[CH2], [O]=[CX2]=O, [O]=C([a!#6])[a!#6], [O-1]C(=O), [#15], [P](-*)=*, [S](-*)-*, [S]=*, [S](-*)(-*)=*, [S](-*)(-*)(=*)=*, [SH]-*, [s](:*):*, [s](=*)(:*):*, [S-*], [S+*]). Note: a lot of these fragments may have individual counts greater than five but overall they end up with counts less than five due to the way OpenBabel assigns counts to multiple fragments on the same atom.

We removed the four correlated descriptors: [C]=c, [O]=[#7], [O]=c, [PH](-*)(-*)=* by using the caret package and the following code in R code mydata = read.csv(file="TrainingWithDescriptorsReadyForFeatureSelection.csv", head=TRUE,row.names="molID") cor.mat = cor(mydata) findCorrelation(cor.mat, cutoff = .95, verbose = TRUE) [output] [1] 74 75 50 84 [output] code
 * 1) load in data
 * 1) correlation matrix
 * 1) find correlation r > 0.95

Again using R, we developed a simple linear model and performed further feature selection to attain the fragment coefficients. code mydata = read.csv(file="TrainingWithDescriptorsReadyForLinearRegression.csv", head=TRUE,row.names="molID") fit <- lm(mpC ~ 0 + tf1 + tf2 + tf4 + tf5 + tf6 + tf7 + tf9 + tf10 + tf11 + tf12 + tf14 + tf17 + tf18 + tf20 + tf22 + tf23 + tf25 + tf26 + tf27 + tf28 + tf34 + tf36 + tf40 + tf44 + tf45 + tf46 + tf47 + tf48 + tf49 + tf50 + tf55 + tf56 + tf57 + tf58 + tf59 + tf60 + tf61 + tf62 + tf63 + tf64 + tf65 + tf66 + tf67 + tf68 + tf69 + tf70 + tf71 + tf72 + tf73 + tf76 + tf77 + tf78 + tf79 + tf81 + tf83 + tf84 + tf86 + tf87 + tf104 + tf105 + tf106 + tf109 + tf113 + tf115 + tf124 + tf128 + tf129 + tf131 + tf132 + tf133 + tf134 + tf135 + tf142 + tf143 + tf144 + tf147 + tf148 + tf149 + tf153 + tf155 + tf164 + tf167, data=mydata)
 * 1) load in data

bestfit <- stepAIC(fit, upper=~., lower=~1, direction="both")
 * 1) step-wise feature selection to remove non-significant features

summary(bestfit)
 * 1) output results

[output] Call: lm(formula = mpC ~ tf2 + tf4 + tf5 + tf9 + tf10 + tf11 + tf12 +   tf14 + tf17 + tf20 + tf22 + tf23 + tf25 + tf26 + tf28 + tf34 +    tf36 + tf40 + tf44 + tf45 + tf46 + tf47 + tf48 + tf50 + tf55 +    tf56 + tf57 + tf58 + tf59 + tf60 + tf61 + tf62 + tf63 + tf64 +    tf65 + tf66 + tf67 + tf68 + tf69 + tf70 + tf71 + tf72 + tf73 +    tf76 + tf78 + tf79 + tf81 + tf84 + tf86 + tf87 + tf104 +    tf106 + tf109 + tf113 + tf115 + tf124 + tf131 + tf132 + tf144 +    tf147 + tf148 + tf155 + tf164 - 1, data = mydata)

Residuals: Min     1Q  Median      3Q     Max -784.85 -32.22    0.41   35.55  351.06

Coefficients: Estimate Std. Error t value Pr(>|t|) tf2  -11.2891     0.3411 -33.094  < 2e-16 *** tf4   -8.0352     5.1829  -1.550 0.121079 tf5  -36.0534     5.0458  -7.145 9.37e-13 *** tf9   32.1543     2.2267  14.440  < 2e-16 *** tf10   7.3289     0.8070   9.082  < 2e-16 *** tf11  22.9132     1.3487  16.989  < 2e-16 *** tf12  48.0277     2.6471  18.143  < 2e-16 *** tf14  26.5074     4.4937   5.899 3.74e-09 *** tf17  42.6169     4.3065   9.896  < 2e-16 *** tf20  44.4787     1.3334  33.356  < 2e-16 *** tf22  24.8303    13.6538   1.819 0.068998. tf23  40.0567     7.7142   5.193 2.10e-07 *** tf25 -15.2212     0.5951 -25.577  < 2e-16 *** tf26  -8.2654     0.9197  -8.987  < 2e-16 *** tf28  14.7222     4.9566   2.970 0.002980 ** tf34  17.7551     2.5995   6.830 8.79e-12 *** tf36  34.2188     8.6157   3.972 7.17e-05 *** tf40  40.6006     1.1303  35.920  < 2e-16 *** tf44  32.9369     1.2416  26.527  < 2e-16 *** tf45  20.7267     0.7192  28.818  < 2e-16 *** tf46 -14.8687     2.5568  -5.815 6.17e-09 *** tf47  15.5201     0.8502  18.255  < 2e-16 *** tf48  18.9400     0.9590  19.750  < 2e-16 *** tf50   8.0103     1.6218   4.939 7.93e-07 *** tf55  28.8598     1.7874  16.147  < 2e-16 *** tf56  27.9572     2.0982  13.325  < 2e-16 *** tf57  19.6289     1.1059  17.750  < 2e-16 *** tf58  25.3437     1.1923  21.257  < 2e-16 *** tf59  26.4790     1.1996  22.073  < 2e-16 *** tf60  10.0405     5.5401   1.812 0.069954. tf61  11.6743     0.3470  33.641  < 2e-16 *** tf62  22.2839     0.6271  35.536  < 2e-16 *** tf63  18.3672     0.8946  20.530  < 2e-16 *** tf64  17.1777     0.5595  30.702  < 2e-16 *** tf65  20.7165     1.1950  17.335  < 2e-16 *** tf66  18.4476     0.7396  24.942  < 2e-16 *** tf67  17.1867     2.1395   8.033 1.02e-15 *** tf68  31.1712     6.9949   4.456 8.40e-06 *** tf69  43.4894     9.0487   4.806 1.55e-06 *** tf70  35.4438     2.0479  17.307  < 2e-16 *** tf71  18.5955     3.2410   5.738 9.78e-09 *** tf72  16.8983     8.4073   2.010 0.044454 * tf73  26.0819     1.6811  15.515  < 2e-16 *** tf76  21.9141     0.7073  30.982  < 2e-16 *** tf78  13.9251     1.2672  10.989  < 2e-16 *** tf79   7.7621     1.4669   5.291 1.23e-07 *** tf81  19.4670     1.5937  12.215  < 2e-16 *** tf84  -4.9750     2.4064  -2.067 0.038711 * tf86 -22.2462     7.8444  -2.836 0.004575 ** tf87 -21.3490     4.7744  -4.472 7.82e-06 *** tf104 -34.4811    4.7166  -7.311 2.79e-13 *** tf106 -32.8638   16.7665  -1.960 0.050003. tf109 13.9377     2.2851   6.099 1.09e-09 *** tf113 15.8908     0.6518  24.380  < 2e-16 *** tf115 96.6851    11.5045   8.404  < 2e-16 *** tf124 10.3334     0.8373  12.341  < 2e-16 *** tf131 -8.9495     0.7989 -11.202  < 2e-16 *** tf132 -6.6031     1.9248  -3.431 0.000604 *** tf144 -27.6702   14.3434  -1.929 0.053732. tf147  8.5440     3.4518   2.475 0.013324 * tf148  4.8169     2.8596   1.684 0.092105. tf155 11.6943     4.6059   2.539 0.011127 * tf164  8.9702     1.5402   5.824 5.86e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 57.25 on 15952 degrees of freedom Multiple R-squared: 0.836,     Adjusted R-squared: 0.8353 F-statistic: 1290 on 63 and 15952 DF,  p-value: < 2.2e-16 [output] code

Following the instructions on the OpenBabel plugin development page, we deployed the melting point model plugin by creating the appropriate text file and by updating the plugindefines.txt file in the data folder. We then used OpenBabel to calculate the melting points of the 3500 compound test set with an R2: 0.67, AAE: 43.6, and RMSE: 58.1, illustrated below.

Results
We successfully created an open fragment-based model that can easily be added to OpenBabel. We used R to create the linear model and OpenBabel as the fragment calculator for an open data set of melting points. The test-set statistics R2: 0.61, AAE: 43.6, and RMSE: 58.1, can likely be improved upon by further curation of the open melting point data (or by using our less diverse but higher quality doubleplusgood dataset), by introducing new fragments, by optimization of the fragment order, and by using more sophisticated feature selection and linear modeling techniques than used here.