More modeling of melting points using Open Babel

Researcher: Andrew SID Lang

Introduction

The goal of this study is to create an open (CC0) fragment-based melting point model that can be deployed into Open Babel. This model will replace our previous model, which used a limited set of fragments, that has performance statistics of R2: 0.61, AAE: 43.6, and RMSE: 58.1 on the 3500 compound test-set.

Procedure

To make models easily comparable, we use here the same melting point dataset (ONSMP031.csv) as used in our previous models (MPModel007, MeltingPointModel011), keeping the exact same training-set and test-set as before too.

We converted the SMILES from the same training set (ONSMP031TrainingSet.csv) into an sdf file which we then ran through SMF v3.03 to calculate fragments, both atom/bond sequences (lengths: 2-5). This gave us a training-set of 16,015 compounds with 2568 sequence descriptors.

We removed strongly correlated descriptors by using the following code in R (uses the caret package)
## set the working directory
setwd("C:/MeltingPointModel013")
 
## load the caret package
library(caret)
 
## load in data
mydata = read.csv(file="TrainingSetWithSequenceFragments.csv",head=TRUE,row.names="molID")
 
## get the number of columns
ncol(mydata)
[output]
2569
[output]
 
## remove small variance columns
nzv <-nearZeroVar(mydata)
mydata <- mydata[, -nzv]
 
ncol(mydata)
[output]
101
[output]
 
## create the correlation matrix
cmatrix <- cor(mydata);
 
## remove highly correlated descriptors
highc <- findCorrelation(cmatrix, 0.90, verbose=T);
mydata <- mydata[, -highc];
 
ncol(mydata)
[output]
74
[output]
write.csv(mydata,file="DescriptorsFeatureSelected.csv")
 

Here's the resulting file (DescriptorsFeatureSelected.csv). Again using R, we developed a simple linear model and performed further feature selection to attain the fragment coefficients.

## load in data
mydata = read.csv(file="DescriptorsFeatureSelected.csv", head=TRUE,row.names="molID")
 
fit <- lm(mpC ~ 0 + ., data=mydata)
summary(fit)
 
[output]
Residual standard error: 63.57 on 15942 degrees of freedom
Multiple R-squared:  0.7979,    Adjusted R-squared:  0.797
F-statistic: 862.2 on 73 and 15942 DF,  p-value: < 2.2e-16
[output]
 
## step-wise feature selection to remove non-significant features
library(MASS)
bestfit <- stepAIC(fit, upper=~., lower=~1, direction="both")
 
## display results
bestfit$anova
 
[output]
Stepwise Model Path
Analysis of Deviance Table
 
Initial Model:
mpC ~ 0 + (s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 + s17 +
    s18 + s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 + s28 +
    s29 + s30 + s31 + s34 + s36 + s37 + s38 + s39 + s40 + s41 +
    s43 + s44 + s46 + s47 + s48 + s49 + s50 + s79 + s80 + s81 +
    s84 + s91 + s92 + s95 + s101 + s108 + s111 + s113 + s115 +
    s118 + s120 + s121 + s124 + s126 + s128 + s129 + s130 + s131 +
    s133 + s146 + s156 + s164 + s168 + s177 + s184 + s185 + s221 +
    s222 + s223 + s228 + s229 + s280 + s285 + s289)
 
Final Model:
mpC ~ s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 + s17 + s18 +
    s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 + s28 + s29 +
    s30 + s31 + s34 + s36 + s38 + s39 + s40 + s44 + s46 + s47 +
    s48 + s49 + s50 + s80 + s81 + s84 + s91 + s92 + s101 + s108 +
    s111 + s113 + s118 + s120 + s121 + s124 + s126 + s128 + s129 +
    s130 + s131 + s146 + s164 + s168 + s177 + s184 + s185 + s222 +
    s228 + s229 + s280 + s285 + s289 - 1
 
 
     Step Df   Deviance Resid. Df Resid. Dev      AIC
1                           15942   64424489 133066.1
2  - s133  1   17.75295     15943   64424507 133064.1
3   - s41  1   20.95081     15944   64424528 133062.1
4  - s223  1   47.27437     15945   64424575 133060.1
5   - s43  1  396.60900     15946   64424972 133058.2
6  - s221  1  447.07669     15947   64425419 133056.3
7  - s156  1  454.47326     15948   64425873 133054.4
8   - s37  1 1362.72231     15949   64427236 133052.8
9   - s95  1 1297.73838     15950   64428534 133051.1
10 - s115  1 4276.68111     15951   64432810 133050.1
11  - s79  1 7519.50646     15952   64440330 133050.0
[output]
 
summary(bestfit)  ## output the final model
 
[output]
Call:
lm(formula = mpC ~ s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 +
    s17 + s18 + s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 +
    s28 + s29 + s30 + s31 + s34 + s36 + s38 + s39 + s40 + s44 +
    s46 + s47 + s48 + s49 + s50 + s80 + s81 + s84 + s91 + s92 +
    s101 + s108 + s111 + s113 + s118 + s120 + s121 + s124 + s126 +
    s128 + s129 + s130 + s131 + s146 + s164 + s168 + s177 + s184 +
    s185 + s222 + s228 + s229 + s280 + s285 + s289 - 1, data = mydata)
 
Residuals:
    Min      1Q  Median      3Q     Max
-526.98  -37.40   -1.28   37.25  376.06
 
Coefficients:
     Estimate Std. Error t value Pr(>|t|)
s3    -1.8498     0.2132  -8.675  < 2e-16 ***
s4   -25.2846     4.1205  -6.136 8.64e-10 ***
s7     6.2586     0.6926   9.036  < 2e-16 ***
s10   26.7114     2.3063  11.582  < 2e-16 ***
s11    2.8083     0.6881   4.081 4.50e-05 ***
s12   51.9087     2.5122  20.662  < 2e-16 ***
s13    8.4622     0.6209  13.628  < 2e-16 ***
s16    3.7174     0.9865   3.768 0.000165 ***
s17    7.8643     1.9408   4.052 5.10e-05 ***
s18    5.8715     0.7096   8.274  < 2e-16 ***
s20   16.7968     1.4447  11.627  < 2e-16 ***
s21    0.6915     0.4553   1.519 0.128837
s22   -1.4671     0.3901  -3.761 0.000170 ***
s23   -2.9295     1.2455  -2.352 0.018676 *
s24   -7.0190     0.6391 -10.983  < 2e-16 ***
s25    1.0471     0.5623   1.862 0.062603 .
s26   -2.2683     1.3571  -1.671 0.094649 .
s27  -10.0463     2.6375  -3.809 0.000140 ***
s28   -2.3992     0.8358  -2.871 0.004103 **
s29   19.5947     1.4171  13.828  < 2e-16 ***
s30   47.0353     2.9064  16.183  < 2e-16 ***
s31  -15.4312     1.2317 -12.528  < 2e-16 ***
s34    7.3887     1.0701   6.905 5.22e-12 ***
s36   10.9887     1.1125   9.878  < 2e-16 ***
s38  -12.8826     1.6467  -7.824 5.46e-15 ***
s39    9.1593     1.4341   6.387 1.74e-10 ***
s40   18.2006     1.7630  10.323  < 2e-16 ***
s44   -2.8801     1.1500  -2.505 0.012272 *
s46   -2.1130     1.2342  -1.712 0.086924 .
s47    2.2720     1.1813   1.923 0.054461 .
s48   -3.2273     1.4964  -2.157 0.031039 *
s49    7.8352     2.2002   3.561 0.000370 ***
s50  -11.9132     1.4625  -8.146 4.04e-16 ***
s80   28.2482     2.5847  10.929  < 2e-16 ***
s81   -5.3919     1.7741  -3.039 0.002375 **
s84   16.1194     2.4100   6.689 2.33e-11 ***
s91   14.8185     3.1938   4.640 3.52e-06 ***
s92   -2.4609     1.7168  -1.433 0.151751
s101  -9.6603     3.3528  -2.881 0.003966 **
s108 -45.5597     2.1051 -21.643  < 2e-16 ***
s111   3.5281     1.7623   2.002 0.045299 *
s113  -8.3954     1.5665  -5.359 8.47e-08 ***
s118 -12.3305     1.5313  -8.053 8.68e-16 ***
s120  -7.6604     1.1297  -6.781 1.24e-11 ***
s121   4.0442     1.3917   2.906 0.003665 **
s124  -6.6457     0.7902  -8.410  < 2e-16 ***
s126   2.3230     1.0538   2.205 0.027503 *
s128   3.9446     1.0501   3.757 0.000173 ***
s129   3.6411     0.1770  20.575  < 2e-16 ***
s130  -5.4580     1.5849  -3.444 0.000575 ***
s131 -25.8821     2.5100 -10.311  < 2e-16 ***
s146  -2.1609     1.0674  -2.024 0.042950 *
s164  -5.3542     1.1379  -4.705 2.56e-06 ***
s168  16.9584     1.1291  15.019  < 2e-16 ***
s177   3.4397     1.5135   2.273 0.023065 *
s184  -3.0319     1.0308  -2.941 0.003274 **
s185 -12.8910     1.3574  -9.497  < 2e-16 ***
s222  10.6815     2.2331   4.783 1.74e-06 ***
s228   7.2849     1.3844   5.262 1.44e-07 ***
s229  14.5218     2.3799   6.102 1.07e-09 ***
s280  12.2807     2.0545   5.978 2.31e-09 ***
s285  -8.2762     2.5626  -3.230 0.001242 **
s289  -5.4504     2.4810  -2.197 0.028045 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 63.56 on 15952 degrees of freedom
Multiple R-squared:  0.7978,    Adjusted R-squared:  0.797
F-statistic: 999.3 on 63 and 15952 DF,  p-value: < 2.2e-16
[output]
 

Open Babel Implementation


Following the instructions on the OpenBabel plugin development page, we deployed the melting point model plugin by creating the appropriate text file (20180112mpC.txt) and by updating the plugindefines.txt file in the data folder. We then used OpenBabel to calculate the melting points of the 3500 compound test set via the following command line:

obabel -ismi ONSMP031TestSet.smi -O testsetout.txt --append OMP

The resulting predictions matched the measured values with an R2: 0.61, AAE: 102.9, and RMSE: 58.2. The performance of the model is no better than the previous model previous model

Results


We successfully created an open fragment-based model that can easily be added to OpenBabel. We used R to create the linear model and SMF as the fragment calculator for an open data set of melting points. The test-set statistics R2: 0.61, AAE: 102.9, and RMSE: 58.2.