MeltingPointModel013

More modeling of melting points using Open Babel

 * Researcher: Andrew SID Lang**

Introduction
The goal of this study is to create an open (CC0) fragment-based melting point model that can be deployed into Open Babel. This model will replace our previous model, which used a limited set of fragments, that has performance statistics of R2: 0.61, AAE: 43.6, and RMSE: 58.1 on the 3500 compound test-set.

Procedure
To make models easily comparable, we use here the same melting point dataset as used in our previous models (MPModel007, @MeltingPointModel011), keeping the exact same training-set and test-set as before too.

We converted the SMILES from the same training set into an sdf file which we then ran through SMF v3.03 to calculate fragments, both atom/bond sequences (lengths: 2-5). This gave us a training-set of 16,015 compounds with 2568 sequence descriptors.

We removed strongly correlated descriptors by using the following code in R (uses the caret package) code setwd("C:/MeltingPointModel013")
 * 1) set the working directory

library(caret)
 * 1) load the caret package

mydata = read.csv(file="TrainingSetWithSequenceFragments.csv",head=TRUE,row.names="molID")
 * 1) load in data

ncol(mydata) [output] 2569 [output]
 * 1) get the number of columns

nzv <-nearZeroVar(mydata) mydata <- mydata[, -nzv]
 * 1) remove small variance columns

ncol(mydata) [output] 101 [output]

cmatrix <- cor(mydata);
 * 1) create the correlation matrix

highc <- findCorrelation(cmatrix, 0.90, verbose=T); mydata <- mydata[, -highc];
 * 1) remove highly correlated descriptors

ncol(mydata) [output] 74 [output] write.csv(mydata,file="DescriptorsFeatureSelected.csv")

code

Here's the resulting file. Again using R, we developed a simple linear model and performed further feature selection to attain the fragment coefficients.

code mydata = read.csv(file="DescriptorsFeatureSelected.csv", head=TRUE,row.names="molID")
 * 1) load in data

fit <- lm(mpC ~ 0 + ., data=mydata) summary(fit)

[output] Residual standard error: 63.57 on 15942 degrees of freedom Multiple R-squared: 0.7979,    Adjusted R-squared:  0.797 F-statistic: 862.2 on 73 and 15942 DF, p-value: < 2.2e-16 [output]

library(MASS) bestfit <- stepAIC(fit, upper=~., lower=~1, direction="both")
 * 1) step-wise feature selection to remove non-significant features

bestfit$anova
 * 1) display results

[output] Stepwise Model Path Analysis of Deviance Table

Initial Model: mpC ~ 0 + (s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 + s17 +   s18 + s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 + s28 +    s29 + s30 + s31 + s34 + s36 + s37 + s38 + s39 + s40 + s41 +    s43 + s44 + s46 + s47 + s48 + s49 + s50 + s79 + s80 + s81 +    s84 + s91 + s92 + s95 + s101 + s108 + s111 + s113 + s115 +    s118 + s120 + s121 + s124 + s126 + s128 + s129 + s130 + s131 +    s133 + s146 + s156 + s164 + s168 + s177 + s184 + s185 + s221 +    s222 + s223 + s228 + s229 + s280 + s285 + s289)

Final Model: mpC ~ s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 + s17 + s18 + s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 + s28 + s29 + s30 + s31 + s34 + s36 + s38 + s39 + s40 + s44 + s46 + s47 + s48 + s49 + s50 + s80 + s81 + s84 + s91 + s92 + s101 + s108 + s111 + s113 + s118 + s120 + s121 + s124 + s126 + s128 + s129 + s130 + s131 + s146 + s164 + s168 + s177 + s184 + s185 + s222 + s228 + s229 + s280 + s285 + s289 - 1

Step Df  Deviance Resid. Df Resid. Dev     AIC 1                          15942   64424489 133066.1 2  - s133  1   17.75295     15943   64424507 133064.1 3  - s41  1   20.95081     15944   64424528 133062.1 4 - s223  1   47.27437     15945   64424575 133060.1 5  - s43  1  396.60900     15946   64424972 133058.2 6 - s221  1  447.07669     15947   64425419 133056.3 7 - s156  1  454.47326     15948   64425873 133054.4 8  - s37  1 1362.72231     15949   64427236 133052.8 9  - s95  1 1297.73838     15950   64428534 133051.1 10 - s115 1 4276.68111     15951   64432810 133050.1 11 - s79  1 7519.50646     15952   64440330 133050.0 [output]

summary(bestfit) ## output the final model

[output] Call: lm(formula = mpC ~ s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 +   s17 + s18 + s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 +    s28 + s29 + s30 + s31 + s34 + s36 + s38 + s39 + s40 + s44 +    s46 + s47 + s48 + s49 + s50 + s80 + s81 + s84 + s91 + s92 +    s101 + s108 + s111 + s113 + s118 + s120 + s121 + s124 + s126 +    s128 + s129 + s130 + s131 + s146 + s164 + s168 + s177 + s184 +    s185 + s222 + s228 + s229 + s280 + s285 + s289 - 1, data = mydata)

Residuals: Min     1Q  Median      3Q     Max -526.98 -37.40   -1.28   37.25  376.06

Coefficients: Estimate Std. Error t value Pr(>|t|) s3   -1.8498     0.2132  -8.675  < 2e-16 *** s4  -25.2846     4.1205  -6.136 8.64e-10 *** s7    6.2586     0.6926   9.036  < 2e-16 *** s10  26.7114     2.3063  11.582  < 2e-16 *** s11   2.8083     0.6881   4.081 4.50e-05 *** s12  51.9087     2.5122  20.662  < 2e-16 *** s13   8.4622     0.6209  13.628  < 2e-16 *** s16   3.7174     0.9865   3.768 0.000165 *** s17   7.8643     1.9408   4.052 5.10e-05 *** s18   5.8715     0.7096   8.274  < 2e-16 *** s20  16.7968     1.4447  11.627  < 2e-16 *** s21   0.6915     0.4553   1.519 0.128837 s22  -1.4671     0.3901  -3.761 0.000170 *** s23  -2.9295     1.2455  -2.352 0.018676 * s24  -7.0190     0.6391 -10.983  < 2e-16 *** s25   1.0471     0.5623   1.862 0.062603. s26  -2.2683     1.3571  -1.671 0.094649. s27 -10.0463     2.6375  -3.809 0.000140 *** s28  -2.3992     0.8358  -2.871 0.004103 ** s29  19.5947     1.4171  13.828  < 2e-16 *** s30  47.0353     2.9064  16.183  < 2e-16 *** s31 -15.4312     1.2317 -12.528  < 2e-16 *** s34   7.3887     1.0701   6.905 5.22e-12 *** s36  10.9887     1.1125   9.878  < 2e-16 *** s38 -12.8826     1.6467  -7.824 5.46e-15 *** s39   9.1593     1.4341   6.387 1.74e-10 *** s40  18.2006     1.7630  10.323  < 2e-16 *** s44  -2.8801     1.1500  -2.505 0.012272 * s46  -2.1130     1.2342  -1.712 0.086924. s47   2.2720     1.1813   1.923 0.054461. s48  -3.2273     1.4964  -2.157 0.031039 * s49   7.8352     2.2002   3.561 0.000370 *** s50 -11.9132     1.4625  -8.146 4.04e-16 *** s80  28.2482     2.5847  10.929  < 2e-16 *** s81  -5.3919     1.7741  -3.039 0.002375 ** s84  16.1194     2.4100   6.689 2.33e-11 *** s91  14.8185     3.1938   4.640 3.52e-06 *** s92  -2.4609     1.7168  -1.433 0.151751 s101 -9.6603     3.3528  -2.881 0.003966 ** s108 -45.5597    2.1051 -21.643  < 2e-16 *** s111  3.5281     1.7623   2.002 0.045299 * s113 -8.3954     1.5665  -5.359 8.47e-08 *** s118 -12.3305    1.5313  -8.053 8.68e-16 *** s120 -7.6604     1.1297  -6.781 1.24e-11 *** s121  4.0442     1.3917   2.906 0.003665 ** s124 -6.6457     0.7902  -8.410  < 2e-16 *** s126  2.3230     1.0538   2.205 0.027503 * s128  3.9446     1.0501   3.757 0.000173 *** s129  3.6411     0.1770  20.575  < 2e-16 *** s130 -5.4580     1.5849  -3.444 0.000575 *** s131 -25.8821    2.5100 -10.311  < 2e-16 *** s146 -2.1609     1.0674  -2.024 0.042950 * s164 -5.3542     1.1379  -4.705 2.56e-06 *** s168 16.9584     1.1291  15.019  < 2e-16 *** s177  3.4397     1.5135   2.273 0.023065 * s184 -3.0319     1.0308  -2.941 0.003274 ** s185 -12.8910    1.3574  -9.497  < 2e-16 *** s222 10.6815     2.2331   4.783 1.74e-06 *** s228  7.2849     1.3844   5.262 1.44e-07 *** s229 14.5218     2.3799   6.102 1.07e-09 *** s280 12.2807     2.0545   5.978 2.31e-09 *** s285 -8.2762     2.5626  -3.230 0.001242 ** s289 -5.4504     2.4810  -2.197 0.028045 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 63.56 on 15952 degrees of freedom Multiple R-squared: 0.7978,    Adjusted R-squared:  0.797 F-statistic: 999.3 on 63 and 15952 DF, p-value: < 2.2e-16 [output]

code

Open Babel Implementation
Following the instructions on the OpenBabel plugin development page, we deployed the melting point model plugin by creating the appropriate text file and by updating the plugindefines.txt file in the data folder. We then used OpenBabel to calculate the melting points of the 3500 compound test set via the following command line:

code obabel -ismi ONSMP031TestSet.smi -O testsetout.txt --append OMP code

The resulting predictions matched the measured values with an R2: 0.61, AAE: 102.9, and RMSE: 58.2. The performance of the model is no better than the previous model previous model

Results
We successfully created an open fragment-based model that can easily be added to OpenBabel. We used R to create the linear model and SMF as the fragment calculator for an open data set of melting points. The test-set statistics R2: 0.61, AAE: 102.9, and RMSE: 58.2.