Recent Changes

Tuesday, April 25

  1. page Enthalpy of Solvation edited ... Variable Importance Plot Results We can see... {PCAColouredByAE.png} PCA Colored by AE …
    ...
    Variable Importance Plot
    Results
    We can see...
    {PCAColouredByAE.png}

    PCA Colored by AE
    References
    (view changes)
    11:25 am
  2. 11:24 am
  3. page Enthalpy of Solvation edited ... [output] Call: ... = TRUE) Type of random forest: regression Number of trees: 500 .…
    ...
    [output]
    Call:
    ...
    = TRUE)
    Type of random forest: regression
    Number of trees: 500
    ...
    training.predict <- predict(mydata.rf,mydata)
    write.csv(training.predict, file = "AllDataRFPredict.csv")
    ...
    important descriptors (ALogP, XLogP,(nHBDon, TopoPSA, nAtomP, MDEC.23, khs,aaCH, nHBAcc)khs.sOH, AMR) can be
    ...
    following plot:
    {20150728RFVARIMP.png} Variable

    {VariableImportance.png}
    Variable
    Importance Plot
    Note: The correlations between the ALogP and XLogP descriptors = 0.83. If removing all descriptors with correlations >=0.83 was done, then only 70 descriptors would remain instead of 87.

    Results
    {20150728PCAWithAE.png} PCAPCA Colored by AE
    The following results are an improvement of the model by Abraham and Acree who report a Training-Set R2 of 0.83
    RF Out of Bag (Training Set)
    Mean Square Error: 0.38
    R2: 0.63
    Test-Set
    Mean Square Error: 0.44
    R2: 0.54
    Full RF Out of Bag
    Mean Square Error: 0.34
    R2: 0.66
    Full RF Training-Set
    Mean Square Error: 0.06
    R2: 0.94
    Performance Issues
    Using DMax to find where the model does well and where is does not do well. The only significant factor is LogS itself. This may be due to the quality of data for low solubilities. When the MSE and R2 values are calculated for compunds both above and below 0.01 M, we see that for compounds with LogS <= -0.2 the MSE is 0.13 and for compounds with LogS > -0.2 the MSE is 0.05. By PC-distance - no relationship, unlike what we saw previously for Abraham solvent coefficients:
    http://journal.chemistrycentral.com/content/9/1/12
    {20150728AEvsMeasured.png}

    References
    1. Wiiliam E. Acree and Andrew SID Lang. Acree Enthalpy of Solvation Dataset. figshare. (2015). https://doi.org/10.6084/m9.figshare.1572326.v1
    (view changes)
    11:24 am
  4. page Enthalpy of Solvation edited ... write.csv(test.predict, file = "RFTestSetPredict.csv") {TestSetPredictedVsMeasured.…
    ...
    write.csv(test.predict, file = "RFTestSetPredict.csv")
    {TestSetPredictedVsMeasured.png}
    ...
    vs Measured 1-octanol solubilityDeltaHsolv for the
    ...
    were: AAE 0.53,4.76, MSE 0.44,39.7, R2 0.5482.9% (We did not include azelaic acid in our calculations).
    Create RF model using all data.
    //library("randomForest")//
    //setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments")//
    //mydata
    library("randomForest")
    mydata
    = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")//
    //mydata.rf
    read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
    mydata.rf
    <- randomForest(LogSrandomForest(DeltaHsolv ~ .,
    ...
    mydata,importance = TRUE)//
    //print(mydata.rf)//
    //[output]//
    //Call://
    //randomForest(formula
    TRUE)
    print(mydata.rf)
    [output]
    Call:
    randomForest(formula
    = LogSDeltaHsolv ~ .,
    ...
    importance = TRUE)//
    //Type
    TRUE)
    Type
    of random forest: regression//
    //Number
    regression
    Number
    of trees: 500//
    //No.
    500
    No.
    of variables
    ...
    each split: 28//
    //Mean
    28
    Mean
    of squared residuals: 0.3435526//
    //%
    65.72521
    %
    Var explained: 65.7//
    //[output]//
    //varImpPlot(mydata.rf,main="Random
    87.47
    [output]
    varImpPlot(mydata.rf,main="Random
    Forest Variable Importance")//
    //saveRDS(mydata.rf,
    Importance")
    saveRDS(mydata.rf,
    file = "Octanol1")//
    //training.predict
    "EnthalpyOfSolvation")
    training.predict
    <- predict(mydata.rf,mydata)//
    //write.csv(training.predict,
    predict(mydata.rf,mydata)
    write.csv(training.predict,
    file = "FullRFTrainingSetPredict.csv")//"AllDataRFPredict.csv")
    The most important descriptors (ALogP, XLogP, TopoPSA, nAtomP, MDEC.23, khs,aaCH, nHBAcc) can be seen on the following plot:
    {20150728RFVARIMP.png} Variable Importance Plot
    (view changes)
    11:10 am
  5. page Enthalpy of Solvation edited ... [output] Call: ... = TRUE) Type of random forest: regression Number of trees: 500 .…
    ...
    [output]
    Call:
    ...
    = TRUE)
    Type of random forest: regression
    Number of trees: 500
    ...
    test.predict <- predict(mydata.rf,test)
    write.csv(test.predict, file = "RFTestSetPredict.csv")
    {20150728PredictedVSMeasuredTest.png}{TestSetPredictedVsMeasured.png}
    Predicted vs Measured 1-octanol solubility for the test-set
    The test-set statistics were: AAE 0.53, MSE 0.44, R2 0.54
    (view changes)
    10:53 am
  6. page Enthalpy of Solvation edited ... ##Modeling library("randomForest") mydata = read.csv(file="SolventInformatio…
    ...
    ##Modeling
    library("randomForest")
    mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")//read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
    ## 75% of the sample size
    smp_size <- floor(0.75 * nrow(mydata))
    ...
    [output]
    Call:
    randomForest(formula = LogSDeltaHsolv ~ .,
    ...
    = TRUE)
    Type of random forest: regression
    //NumberNumber of trees: 500//
    //No.
    500
    No.
    of variables
    ...
    each split: 28//
    //Mean
    28
    Mean
    of squared residuals: 0.3800402//
    //%
    95.31999
    %
    Var explained: 62.66//82.69
    [output]
    saveRDS(mydata.rf, file = "EnthalpyTrainingSet")
    (view changes)
    10:45 am
  7. page Enthalpy of Solvation edited ... [output] write.csv(x, file = "PCA.csv") ... in R. {PCAWithClusters.png} Ch…
    ...
    [output]
    write.csv(x, file = "PCA.csv")
    ...
    in R.
    {PCAWithClusters.png}
    Chemical Space - Clustered Using R
    Modeling
    //##Modeling//
    //library("randomForest")//
    //setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments")//
    //mydata
    ##Modeling
    library("randomForest")
    mydata
    = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")//
    //##
    read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")//
    ##
    75% of the sample size//
    //smp_size
    size
    smp_size
    <- floor(0.75 * nrow(mydata))//
    //##
    nrow(mydata))
    ##
    set the
    ...
    your partition reproductible//
    //set.seed(123)//
    //train_ind
    reproductible
    set.seed(123)
    train_ind
    <- sample(seq_len(nrow(mydata)), size = smp_size)//
    //train
    smp_size)
    train
    <- mydata[train_ind, ]//
    //test
    ]
    test
    <- mydata[-train_ind, ]//
    //mydata.rf
    ]
    mydata.rf
    <- randomForest(LogSrandomForest(DeltaHsolv ~ .,
    ...
    train,importance = TRUE)//
    //print(mydata.rf)//
    //[output]//
    //Call://
    //randomForest(formula
    TRUE)
    print(mydata.rf)
    [output]
    Call:
    randomForest(formula
    = LogS
    ...
    importance = TRUE)//
    //Type
    TRUE)
    Type
    of random forest: regression//regression
    //Number of trees: 500//
    //No. of variables tried at each split: 28//
    //Mean of squared residuals: 0.3800402//
    //% Var explained: 62.66//
    //[output]//
    //saveRDS(mydata.rf,
    [output]
    saveRDS(mydata.rf,
    file = "OctanolTrainingSet")//
    //training.predict
    "EnthalpyTrainingSet")
    training.predict
    <- predict(mydata.rf,train)//
    //write.csv(training.predict,
    predict(mydata.rf,train)
    write.csv(training.predict,
    file = "RFTrainingSetPredict.csv")//
    //test.predict
    "RFTrainingSetPredict.csv")
    test.predict
    <- predict(mydata.rf,test)//
    //write.csv(test.predict,
    predict(mydata.rf,test)
    write.csv(test.predict,
    file = "RFTestSetPredict.csv")//"RFTestSetPredict.csv")
    {20150728PredictedVSMeasuredTest.png}
    Predicted vs Measured 1-octanol solubility for the test-set
    (view changes)
    10:39 am
  8. page Enthalpy of Solvation edited ... [output] write.csv(x, file = "PCA.csv") ... in R. Using DMax Chemistry assistan…
    ...
    [output]
    write.csv(x, file = "PCA.csv")
    ...
    in R. Using DMax Chemistry assistant, we see the two groups are separated by nHBAcc and TopoPSA with the red cluster corresponding to compounds with zero hydrogen bond acceptors and with a topological polar surface area less than 26.48 angstroms. The red cluster is typified by compounds without hydrogen bond acceptors and with ALogP > 1.56 and with TopoPSA < 26.48; 128 out of 157 compounds match this criteria. The blue cluster is more chemically diverse than the red cluster but even so 75 of the 102 compounds have ALogP < 1.56 and TopoPSA > 26.48 and at least one hydrogen bond acceptor.
    Overall the dataset consists of small (typically 100 < MW < 450) druglike compounds (50 compounds have one Lipinski Failure) that are solid at room temperature with 1-octanol solubilities typically between 0.01 M and 1M . Looking at the nAcid descriptor, we see that 49 compounds are carboxylic acids.
    {ChemicalSpaceClusterAnalysis.png}

    {PCAWithClusters.png}

    Chemical Space - Clustered Using R
    Modeling
    (view changes)
    10:29 am

More