This script takes four arguments:
- gen
: A list of SMILES strings representing the de novo generated molecules. Molecules should be found under a column named "SMILES".
- ref1
: A list of SMILES strings representing the reference molecules for novelty calculation. (e.g. ChEMBL molecules)
- ref2
(optional): A list of SMILES strings representing the reference molecules for novelty calculation. (e.g. selected inhibitors)
- output (optional, default: results)
: The output file where the computed metrics will be saved.
The following is a generic example of how to use the evaluation script:
python evaluate.py --gen "[SMILES FILE]" --ref1 "[TRAINING SET FILE]" --ref2 "[TEST SET FILE]" --output "[PERFORMANCE RESULTS FILE]"
To evaluate the AKT1 targeted generated molecules used in the paper, run:
python evaluate.py --gen "generated_molecules/DrugGEN_generated_molecules_AKT1.csv" --ref1 "../data/chembl_train.smi" --ref2 "../data/akt_train.smi" --output "results_akt1"
To evaluate the CDK2 targeted generated molecules used in the paper, run:
python evaluate.py --gen "generated_molecules/DrugGEN_generated_molecules_CDK2.csv" --ref1 "../data/chembl_train.smi" --ref2 "../data/cdk2_train.smi" --output "results_cdk2.csv"
The script calculates the following metrics:
- Validity: The fraction of valid molecules in the generated set.
- Uniqueness: The fraction of unique molecules in the generated set.
- Novelty: The fraction of molecules in the generated set that are not present in the reference sets.
- Internal Diversity: The average Tanimoto similarity between all pairs of molecules in the generated set.
- QED: The average QED score of the molecules in the generated set.
- SA: The average SA score of the molecules in the generated set.
- FCD: The average FCD score of the molecules in the generated set against both reference sets.
- Fragment Similarity: The average fragment similarity score of the molecules in the generated set against both reference sets.
- Scaffold Similarity: The average scaffold similarity score of the molecules in the generated set against both reference sets.
- Lipinski: The fraction of molecules in the generated set that pass the Lipinski filter.
- Veber: The fraction of molecules in the generated set that pass the Veber filter.
- PAINS: The fraction of molecules in the generated set that pass the PAINS filter.