MeltingPointModel010

=Using QSARDB to create a melting point model web service=

Researchers
Andrew SID Lang and Villu Ruusmann


 * [This page is a duplicate (backup) of the original model page on the ONSChallange wiki] **

Objective
To create a relatively small (as compared to our other melting point models) Random Forest based melting point model and distribute it as a web service using the QSARDB Open digital repository.

Background
Our goal is to create an Open CC0 melting point model using Open Data, Open Descriptors (CDK), under a transparent/reproducible/open procedure. One recent solution to deploying models (of all types) in the open is via the QsarDB open digital repository, developed by Villu Ruusmann. We were successful in a previous analysis MPModel009, but the resulting QDB archive was too large to deploy. Here we develop a model on a smaller (but highly curated) dataset and perform feature selection to reduce the number of descriptors used with the goal of creating a good model of reasonable size - less than 10MB - the current limit of the QsarDB repository.

Procedure
code library("caret")
 * Data Collection and Curation.** We began with the doubleplusgood melting point dataset ([[file:20110727doublevalidated.xlsx|ONSMP029]]) of 2706 highly curated double+ validated (range: 0.1-5 C) unique compounds that have no chiral centers or possess cis/trans isomerism. From this set we removed coronene and octaphenylcyclotetrasiloxane as they are obvious outliers of the chemical space. For the remaining 2704 compounds, we generated all CDK descriptors except: CPSA, IP, WHIM, all protein, all geometrical. We then removed HybRatio and Kier3 due to multiple NA entries and all khs.xxx with less than 27 (1%) non-zero values, leaving 161 descriptors.
 * Feature Selection.** While Random Forest models have no problems with highly correlated variables, using highly correlated variables can skew variable importance measures. We decided to use the caret package for R to remove highly correlated descriptors (BCUTc-1h, apol, naAromAtom, nAromBond, nAtom, ATSc2, ATSc3, ATSm2, ATSp1, ATSp2, ATSp3, ATSp4, ATSp5, nB, C1SP1, SCH-3, VCH-4, VC-5, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-7, SPC-5, SPC-6, VPC-5, ECCEN, Kier1, Kier2, VABC, WTPT-1, WPATH, WPOL, Zagreb) found using the following code:

mydata = read.csv(file="20120607DoubleValidatedReadyForFeatureSelection.csv",head=TRUE,row.names="molID")
 * 1) load in data

cor.mat = cor(mydata)
 * 1) correlation matrix

findCorrelation(cor.mat, cutoff = .90, verbose = TRUE)
 * 1) find correlation r > 0.90

[output] 7 12 13 14 15 17 18 22 26 27 28 29 30 32 34 43 49 59 61 62 63 64 65 66 67 69 70 71 72 73 74 76 78 79 81 83 121 122 151 153 158 159 161 [output]

code

This leaves 2704 compounds with 118 descriptors ready for modeling.

code library("randomForest") mydata = read.csv(file="20120607DoubleValidatedReadyRF.csv",head=TRUE,row.names="molID")
 * Modeling.** A random forest was created and serialized using the following code:

mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * 1) do random forest [randomForest 4.5-34]

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 39 Mean of squared residuals: 1451.971 % Var explained: 83.34 [output]

varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) get variable importance plot

code

The RF reports an OOB R2 of 0.83 and an OOB RMSE of 38.1 °C with the resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction as found in all previous analyses.

The model was then saved so that it could be deployed as a web service using the following code: code saveRDS(mydata.rf, file = "ONSMPModel010")

code

The model is available for to use for batch melting point prediction with a CC0 license. The model was then used to predict the melting points of the training set in order to identifier possible errors in the dataset and compounds with are difficult to model using current 2D CDK descriptors (such as coronene). code training.predict <- predict(mydata.rf,mydata) write.csv(training.predict, file = "RFTrainingSetPredict.csv")

code

Plotting the predicted versus measured melting point values using Tableau Public, we see that the melting point of compounds tends to increase with larger TopoPSA (colour) and nHBDon (size) with the top outliers being: cyanic iodide, 2,6-dimethoxy-p-benzoquinone, 2-methyl-4-nitro-1h-imidazole, isophthalic_acid, 2,2,3,3-tetramethylbutane, 2-(1,3-thiazol-4-yl)-1h-benzimidazole, 2-mercaptobenzimidazole, p-quaterphenyl, 4,4'-dihydroxybiphenyl, 2,4-hexadiyne.

code cd C:\alang\share\MyMesh\ONSC\qsardb java -Xms512M -Xmx1024M -cp conversion-toolkit-r595.jar org.qsardb.conversion.SpreadsheetConverter --id D --smiles B --name A --properties C --source C:/alang/share/MyMesh/ONSC/qsardb/originaldata/meltingpoints/ONSMP010.csv --target C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 [prompt] Id (default: 'column_C'): mpC Name (default: 'Column C'): mpC Value format pattern (default: as-is): 0.#
 * QDB Format Archive.** A parallel QDB format archive was created using the same data and the following code in a CMD window:

[prompt] java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 add-cdk java -Xms512M -Xmx1024M -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorCalculator --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 java -Xms512M -Xmx1024M -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 purge java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id BCUTc-1h java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id apol java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id naAromAtom java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id nAromBond java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id nAtom java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSc2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSc3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSm2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp1 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp4 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp5 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id nB java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id C1SP1 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SCH-3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VCH-4 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VC-5 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-0 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-1 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-4 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-5 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-6 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-0 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-1 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-4 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-5 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-7 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SPC-5 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SPC-6 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VPC-5 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ECCEN java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id HybRatio java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sLi java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssBe java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssBe java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssBH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssB java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssB java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.tCH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ddC java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sNH3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssNH2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dNH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssNH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssN java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sSiH3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssSiH2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssSiH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sPH2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssPH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssP java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dsssP java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssssP java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dssS java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sGeH3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssGeH2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssGeH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssGe java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sAsH2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssAsH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssAs java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssdAs java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssssAs java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sSeH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dSe java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssSe java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.aaSe java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dssSe java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ddssSe java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sSnH3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssSnH2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssSnH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssSn java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sPbH3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssPbH2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssPbH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssPb java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id Kier1 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id Kier2 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id Kier3 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VABC java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id WTPT-1 java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id WPATH java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id WPOL java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id Zagreb java -cp prediction-toolkit-r595.jar org.qsardb.prediction.ModelRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 add --id rf --name "Random forest regression" --property-id mpC java -cp prediction-toolkit-r595.jar org.qsardb.prediction.ModelRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 attach-rds --id rf java -cp prediction-toolkit-r595.jar org.qsardb.prediction.PredictionRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 add --id rf-training --name "Random forest regression (training)" --model-id rf java -cp prediction-toolkit-r595.jar org.qsardb.prediction.PredictionRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 attach-values --id rf-training

code

The Random Forest for the QDB archive was then created in R using: code suppressMessages(library("randomForest")) qdbDir = "C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010" propertyId = 'mpC' descriptorIdList = c('ALogP', 'ALogp2', 'AMR', 'BCUTw-1l', 'BCUTw-1h', 'BCUTc-1l', 'BCUTp-1l', 'BCUTp-1h', 'fragC', 'nAcid', 'ATSc1', 'ATSc4', 'ATSc5', 'ATSm1', 'ATSm3', 'ATSm4', 'ATSm5', 'nBase', 'bpol', 'C2SP1', 'C1SP2', 'C2SP2', 'C3SP2', 'C1SP3', 'C2SP3', 'C3SP3', 'C4SP3', 'SCH-4', 'SCH-5', 'SCH-6', 'SCH-7', 'VCH-3', 'VCH-5', 'VCH-6', 'VCH-7', 'SC-3', 'SC-4', 'SC-5', 'SC-6', 'VC-3', 'VC-4', 'VC-6', 'SP-7', 'VP-6', 'SPC-4', 'VPC-4', 'VPC-6', 'FMF', 'nHBDon', 'nHBAcc', 'khs.sCH3', 'khs.dCH2', 'khs.ssCH2', 'khs.dsCH', 'khs.aaCH', 'khs.sssCH', 'khs.tsC', 'khs.dssC', 'khs.aasC', 'khs.aaaC', 'khs.ssssC', 'khs.sNH2', 'khs.ssNH', 'khs.aaNH', 'khs.tN', 'khs.dsN', 'khs.aaN', 'khs.sssN', 'khs.ddsN', 'khs.aasN', 'khs.sOH', 'khs.dO', 'khs.ssO', 'khs.aaO', 'khs.sF', 'khs.ssssSi', 'khs.sSH', 'khs.dS', 'khs.ssS', 'khs.aaS', 'khs.ddssS', 'khs.sCl', 'khs.sBr', 'khs.sI', 'nAtomLC', 'nAtomP', 'LipinskiFailures', 'nAtomLAC', 'MLogP', 'MDEC-11', 'MDEC-12', 'MDEC-13', 'MDEC-14', 'MDEC-22', 'MDEC-23', 'MDEC-24', 'MDEC-33', 'MDEC-34', 'MDEC-44', 'MDEO-11', 'MDEO-12', 'MDEO-22', 'MDEN-11', 'MDEN-12', 'MDEN-13', 'MDEN-22', 'MDEN-23', 'MDEN-33', 'PetitjeanNumber', 'nRotB', 'TopoPSA', 'VAdjMat', 'MW', 'WTPT-2', 'WTPT-3', 'WTPT-4', 'WTPT-5', 'XLogP') loadValues = function(path, id){ result = read.table(path, header = TRUE, sep = "\t", na.strings = "N/A") result = na.omit(result) names(result) = c('Id', gsub("-", "_", x = id)) return (result) } loadPropertyValues = function(id){ return (loadValues(paste(sep = "/", qdbDir, "properties", id, "values"), id)) } loadDescriptorValues = function(id){ return (loadValues(paste(sep = "/", qdbDir, "descriptors", id, "values"), id)) } rfdata = loadPropertyValues(propertyId) for(descriptorId in descriptorIdList){ print (descriptorId) rfdata = merge(rfdata, loadDescriptorValues(descriptorId), by = 'Id') } compoundIds = rfdata$Id rfdata$Id = NULL rfmodel = randomForest(formula = mpC ~ ., data = rfdata) print(rfmodel) object = list object$propertyId = propertyId object$getPropertyId = function(self){ return (self$propertyId) } object$descriptorIdList = descriptorIdList object$getDescriptorIdList = function(self){ return (self$descriptorIdList) } object$rfmodel = rfmodel object$evaluate = function(self, values){ suppressMessages(require("randomForest")) descriptorIdList = self$getDescriptorIdList(self) descriptorIdList = sapply(descriptorIdList, function(x) gsub("-", "_", x)) newrfdata = data.frame(c = NA) for(i in 1:length(descriptorIdList)){ newrfdata[descriptorIdList[i]] = values[i] } return (predict(self$rfmodel, newdata = newrfdata)) } saveRDS(file = paste(sep = "/", qdbDir, "models/rf/rds"), object) rfvalues = predict(rfmodel, rfdata) predictedValues = data.frame(compoundIds, rfvalues) write.table(predictedValues, file = paste(sep = "/", qdbDir, "predictions/rf-training/values"), col.names = c("csid", "mpC"), row.names = FALSE, quote = FALSE, sep = "\t")

code

The model was then zipped and tested using: code java -cp prediction-toolkit-r595.jar org.qsardb.prediction.SMILESPredictor --archive ONSMP010.qdb.zip --format "0.0" --smiles "c1ccc(cc1)O"

code

The QDB archive was then deployed for use as a webservice to the QsarDB Open Digital Repository

Results
An accurate (R2 0.83) melting point model using Open descriptors and Open data was developed and deployed on the QsarDB Open Digitial Repository where it can be used as a webservice.