|
a |
|
b/README.md |
|
|
1 |
# Machine Learning Modelling to Predict the Efficacy of Cancer Treatment Drugs |
|
|
2 |
|
|
|
3 |
>>List of Changeable Script Parameters |
|
|
4 |
- input_dir: Directory path where input files are located, and where output files will be written. |
|
|
5 |
- models: List of regression models to be evaluated. |
|
|
6 |
- num_sp: Number of splits for cross-validation. |
|
|
7 |
- num_rep: Number of repetitions for cross-validation. |
|
|
8 |
Other hyperparameters within models and training functions. |
|
|
9 |
|
|
|
10 |
>>Input Files |
|
|
11 |
|
|
|
12 |
Option 1: Read in provided csv file |
|
|
13 |
|
|
|
14 |
The following raw data file can be read in by setting the input_dir variable to the directory containing the file. |
|
|
15 |
- raw_data_erbb1_ic50.csv: CSV file containing the data on EGFR protein inhibitors, including canonical smiles and IC50 values. |
|
|
16 |
|
|
|
17 |
Option 2: Fetch data from CHEMBL |
|
|
18 |
Alternatively, by setting the fetch_chembl variable to TRUE, one can obtain the data directly from the CHEMBL database. |
|
|
19 |
|
|
|
20 |
>>Output Files Generated |
|
|
21 |
|
|
|
22 |
The following files will be written directly in the input_dir that was set at the beginning of the script. Files highlighted in orange and intermediate outputs that are used in subsequent steps within the script. |
|
|
23 |
- erbb1_bothassay_neglog10_ic50.csv: Processed dataset with transformed IC50 values. |
|
|
24 |
- cb_pb_fingerprints.csv: Molecular fingerprints data. |
|
|
25 |
- df_pb_cb_for_model_building.csv: Final dataset used for model training. |
|
|
26 |
- evaluations_with_cv.csv: Evaluation metrics from cross-validation. |
|
|
27 |
- test_results.csv: Final test results for the optimized model. |
|
|
28 |
- final_feature_importance.csv: Feature importance from the optimized RandomForest model. |
|
|
29 |
|
|
|
30 |
List of Custom Functions |
|
|
31 |
1. logm: Converts IC50 values from nM to -log(M). |
|
|
32 |
2. mol_descriptors: Generates molecular descriptors from SMILES. |
|
|
33 |
3. morgan_fpts: Generates Morgan fingerprints from SMILES. |
|
|
34 |
4. train_evaluate_model_with_cv: Trains models from list of models and performs cross-validation. |
|
|
35 |
5. plot_learning_curve: Plots the learning curve for models from list of models. |
|
|
36 |
|
|
|
37 |
|
|
|
38 |
|
|
|
39 |
Supplementary File (Drawing fingerprint bits of interest) |
|
|
40 |
- Open the DrawFingerprints.ipynb (requires that for_fingerprint_visualization.csv is in same folder) |
|
|
41 |
- Run notebook file to see bit 343 and 1366 visualizations |