|
a |
|
b/README.md |
|
|
1 |
# DiffSBDD: Structure-based Drug Design with Equivariant Diffusion Models |
|
|
2 |
|
|
|
3 |
Official implementation of **DiffSBDD**, an equivariant diffusion model for structure-based drug design, by Arne Schneuing, Charles Harris, Yuanqi Du, Kieran Didi, Arian Jamasb, Ilia Igashov, Weitao Du, Carla Gomes, Tom Blundell, Pietro Lio, Max Welling, Michael Bronstein & Bruno Correia. |
|
|
4 |
|
|
|
5 |
[](https://doi.org/10.1038/s43588-024-00737-x) |
|
|
6 |
[](http://arxiv.org/abs/2210.13695) |
|
|
7 |
[](https://colab.research.google.com/github/arneschneuing/DiffSBDD/blob/main/colab/DiffSBDD.ipynb) |
|
|
8 |
|
|
|
9 |
> [!TIP] |
|
|
10 |
> You can also try out our new 3D generative models for drug design at https://github.com/LPDI-EPFL/DrugFlow. |
|
|
11 |
|
|
|
12 |
 |
|
|
13 |
|
|
|
14 |
1. [Dependencies](#dependencies) |
|
|
15 |
1. [Conda environment](#conda-environment) |
|
|
16 |
3. [Pre-trained models](#pre-trained-models) |
|
|
17 |
2. [Step-by-step examples](#step-by-step-examples) |
|
|
18 |
1. [De novo design](#de-novo-design) |
|
|
19 |
2. [Substructure inpainting](#substructure-inpainting) |
|
|
20 |
3. [Molecular optimization](#molecular-optimization) |
|
|
21 |
3. [Benchmarks](#benchmarks) |
|
|
22 |
1. [CrossDocked Benchmark](#crossdocked) |
|
|
23 |
2. [Binding MOAD](#binding-moad) |
|
|
24 |
3. [Sampled molecules](#sampled-molecules) |
|
|
25 |
4. [Training](#training) |
|
|
26 |
5. [Inference](#inference) |
|
|
27 |
1. [Sample molecules for a given pocket](#sample-molecules-for-a-given-pocket) |
|
|
28 |
2. [Test set sampling](#sample-molecules-for-all-pockets-in-the-test-set) |
|
|
29 |
3. [Fix substructures](#fix-substructures) |
|
|
30 |
4. [Metrics](#metrics) |
|
|
31 |
6. [Citation](#citation) |
|
|
32 |
|
|
|
33 |
## Dependencies |
|
|
34 |
|
|
|
35 |
### Conda environment |
|
|
36 |
```bash |
|
|
37 |
conda create -n sbdd-env |
|
|
38 |
conda activate sbdd-env |
|
|
39 |
conda install pytorch cudatoolkit=10.2 -c pytorch |
|
|
40 |
conda install -c conda-forge pytorch-lightning |
|
|
41 |
conda install -c conda-forge wandb |
|
|
42 |
conda install -c conda-forge rdkit |
|
|
43 |
conda install -c conda-forge biopython |
|
|
44 |
conda install -c conda-forge imageio |
|
|
45 |
conda install -c anaconda scipy |
|
|
46 |
conda install -c pyg pytorch-scatter |
|
|
47 |
conda install -c conda-forge openbabel |
|
|
48 |
conda install seaborn |
|
|
49 |
``` |
|
|
50 |
|
|
|
51 |
The code was tested with the following versions |
|
|
52 |
| Software | Version | |
|
|
53 |
|-------------------|-----------| |
|
|
54 |
| Python | 3.10.4 | |
|
|
55 |
| CUDA | 10.2.89 | |
|
|
56 |
| PyTorch | 1.12.1 | |
|
|
57 |
| PyTorch Lightning | 1.7.4 | |
|
|
58 |
| WandB | 0.13.1 | |
|
|
59 |
| RDKit | 2022.03.2 | |
|
|
60 |
| BioPython | 1.79 | |
|
|
61 |
| imageio | 2.21.2 | |
|
|
62 |
| SciPy | 1.7.3 | |
|
|
63 |
| PyTorch Scatter | 2.0.9 | |
|
|
64 |
| OpenBabel | 3.1.1 | |
|
|
65 |
|
|
|
66 |
### Pre-trained models |
|
|
67 |
Pre-trained models can be downloaded from [Zenodo](https://zenodo.org/record/8183747). |
|
|
68 |
- [CrossDocked, conditional $`C_\alpha`$ model](https://zenodo.org/record/8183747/files/crossdocked_ca_cond.ckpt?download=1) |
|
|
69 |
- [CrossDocked, joint $`C_\alpha`$ model](https://zenodo.org/record/8183747/files/crossdocked_ca_joint.ckpt?download=1) |
|
|
70 |
- [CrossDocked, conditional full-atom model](https://zenodo.org/record/8183747/files/crossdocked_fullatom_cond.ckpt?download=1) |
|
|
71 |
- [CrossDocked, joint full-atom model](https://zenodo.org/record/8183747/files/crossdocked_fullatom_joint.ckpt?download=1) |
|
|
72 |
- [Binding MOAD, conditional $`C_\alpha`$ model](https://zenodo.org/record/8183747/files/moad_ca_cond.ckpt?download=1) |
|
|
73 |
- [Binding MOAD, joint $`C_\alpha`$ model](https://zenodo.org/record/8183747/files/moad_ca_joint.ckpt?download=1) |
|
|
74 |
- [Binding MOAD, conditional full-atom model](https://zenodo.org/record/8183747/files/moad_fullatom_cond.ckpt?download=1) |
|
|
75 |
- [Binding MOAD, joint full-atom model](https://zenodo.org/record/8183747/files/moad_fullatom_joint.ckpt?download=1) |
|
|
76 |
|
|
|
77 |
## Step-by-step examples |
|
|
78 |
|
|
|
79 |
These simple step-by-step examples provide an easy entry point to generating molecules with DiffSBDD. |
|
|
80 |
More details about training and sampling scripts are provided below. |
|
|
81 |
|
|
|
82 |
Before we run the sampling scripts we need to download a model checkpoint: |
|
|
83 |
```bash |
|
|
84 |
wget -P checkpoints/ https://zenodo.org/record/8183747/files/crossdocked_fullatom_cond.ckpt |
|
|
85 |
``` |
|
|
86 |
It will be stored in the `./checkpoints` folder. |
|
|
87 |
|
|
|
88 |
### De novo design |
|
|
89 |
|
|
|
90 |
Using the trained model weights, we can sample new ligands with a single command. In this example, we use the protein with PDB ID `3RFM` that can be found in the example folder. |
|
|
91 |
The PDB file contains a reference ligand in chain A at residue number 330 that we can use to specify the designated binding pocket. |
|
|
92 |
The following command will generate 20 samples and save them in a file called `3rfm_mol.sdf` in the `./example` folder. |
|
|
93 |
```bash |
|
|
94 |
python generate_ligands.py checkpoints/crossdocked_fullatom_cond.ckpt --pdbfile example/3rfm.pdb --outfile example/3rfm_mol.sdf --ref_ligand A:330 --n_samples 20 |
|
|
95 |
``` |
|
|
96 |
Instead of specifying the chain and residue number we can also provide an SDF file with the reference ligand: |
|
|
97 |
```bash |
|
|
98 |
python generate_ligands.py checkpoints/crossdocked_fullatom_cond.ckpt --pdbfile example/3rfm.pdb --outfile example/3rfm_mol.sdf --ref_ligand example/3rfm_B_CFF.sdf --n_samples 20 |
|
|
99 |
``` |
|
|
100 |
If no reference ligand is known, the binding pocket can also be specified as a list of residues as described [below](#sample-molecules-for-a-given-pocket). |
|
|
101 |
|
|
|
102 |
### Substructure inpainting |
|
|
103 |
|
|
|
104 |
To design molecules around fixed substructures (scaffold elaboration, fragment linking etc.) you can run the `inpaint.py` script. |
|
|
105 |
Here, we demonstrate its usage with a fragment linking example. Similar to `generate_ligands.py`, the inpainting script allows us to define pockets based on a reference ligand in SDF format |
|
|
106 |
or with a chain and residue identifier (if it is in the PDB). |
|
|
107 |
The easiest way to fix substructures is to provide them in a separate SDF file using the `--fix_atoms` flag. |
|
|
108 |
However, the script also accepts a list of atom names which must correspond to the atoms of the reference ligand in the PDB file, e.g. `--fix_atoms C1 N6 C5 C12`. |
|
|
109 |
```bash |
|
|
110 |
python inpaint.py checkpoints/crossdocked_fullatom_cond.ckpt --pdbfile example/5ndu.pdb --outfile example/5ndu_linked_mols.sdf --ref_ligand example/5ndu_C_8V2.sdf --fix_atoms example/fragments.sdf --center ligand --add_n_nodes 10 |
|
|
111 |
``` |
|
|
112 |
Note that the `--center ligand` option tells DiffSBDD to sample the additional atoms near the center of mass of the fixed substructure, which is not always ideal or desired. |
|
|
113 |
For instance, the inputs could be two fragments with very different sizes, in which case the random noise will be sampled very close to the larger fragment. |
|
|
114 |
We currently also support sampling in the pocket center (`--center pocket`) but in some cases neither of these two options might be suitable and a problem-specific solution is warranted to avoid bad results. |
|
|
115 |
|
|
|
116 |
Another important parameter is `--add_n_nodes` which determines how many new atoms will be added. If it is not provided, a random number will be sampled. |
|
|
117 |
|
|
|
118 |
### Molecular optimization |
|
|
119 |
|
|
|
120 |
You can use DiffSBDD to optimize existing molecules for given properties via the `optimize.py` script. |
|
|
121 |
|
|
|
122 |
```bash |
|
|
123 |
python optimize.py --checkpoint checkpoints/crossdocked_fullatom_cond.ckpt --pdbfile example/5ndu.pdb --outfile output.sdf --ref_ligand example/5ndu_C_8V2.sdf --objective sa --population_size 100 --evolution_steps 10 --top_k 10 --timesteps 100 |
|
|
124 |
``` |
|
|
125 |
|
|
|
126 |
Important parameters in the evolutionary algorithum are: |
|
|
127 |
- `--checkpoint`: The checkpoint to use for the noising-denoising model. |
|
|
128 |
- `--objective`: The optimization objective. Currently supports 'qed' for Quantitative Estimate of Drug-likeness and 'sa' for Synthetic Accessibility. Custom objectives can be implemented within the code. |
|
|
129 |
- `--population_size`: The size of the molecule population to maintain across the optimization generations. |
|
|
130 |
- `--evolution_steps`: The number of evolutionary steps (generations) to perform during the optimization process. |
|
|
131 |
- `--top_k`: The number of top-scoring molecules to select from one generation to the next. |
|
|
132 |
- `--timesteps`: The number of noise-denoise steps to use in the optimization algorithum. Defaults to 100 (out of T=500). |
|
|
133 |
|
|
|
134 |
|
|
|
135 |
|
|
|
136 |
|
|
|
137 |
## Benchmarks |
|
|
138 |
### CrossDocked |
|
|
139 |
|
|
|
140 |
#### Data preparation |
|
|
141 |
Download and extract the dataset as described by the authors of Pocket2Mol: https://github.com/pengxingang/Pocket2Mol/tree/main/data |
|
|
142 |
|
|
|
143 |
Process the raw data using |
|
|
144 |
```bash |
|
|
145 |
python process_crossdock.py <crossdocked_dir> --no_H |
|
|
146 |
``` |
|
|
147 |
|
|
|
148 |
### Binding MOAD |
|
|
149 |
#### Data preparation |
|
|
150 |
Download the dataset |
|
|
151 |
```bash |
|
|
152 |
wget http://www.bindingmoad.org/files/biou/every_part_a.zip |
|
|
153 |
wget http://www.bindingmoad.org/files/biou/every_part_b.zip |
|
|
154 |
wget http://www.bindingmoad.org/files/csv/every.csv |
|
|
155 |
|
|
|
156 |
unzip every_part_a.zip |
|
|
157 |
unzip every_part_b.zip |
|
|
158 |
``` |
|
|
159 |
Process the raw data using |
|
|
160 |
``` bash |
|
|
161 |
python -W ignore process_bindingmoad.py <bindingmoad_dir> |
|
|
162 |
``` |
|
|
163 |
Add the `--ca_only` flag to create a dataset with $C_\alpha$ pocket representation. |
|
|
164 |
|
|
|
165 |
### Sampled molecules |
|
|
166 |
Sampled molecules can be found on [Zenodo](https://zenodo.org/record/8239058). |
|
|
167 |
|
|
|
168 |
## Training |
|
|
169 |
Starting a new training run: |
|
|
170 |
```bash |
|
|
171 |
python -u train.py --config <config>.yml |
|
|
172 |
``` |
|
|
173 |
|
|
|
174 |
Resuming a previous run: |
|
|
175 |
```bash |
|
|
176 |
python -u train.py --config <config>.yml --resume <checkpoint>.ckpt |
|
|
177 |
``` |
|
|
178 |
|
|
|
179 |
## Inference |
|
|
180 |
|
|
|
181 |
### Sample molecules for a given pocket |
|
|
182 |
To sample small molecules for a given pocket with a trained model use the following command: |
|
|
183 |
```bash |
|
|
184 |
python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outfile <output_file> --resi_list <list_of_pocket_residue_ids> |
|
|
185 |
``` |
|
|
186 |
For example: |
|
|
187 |
```bash |
|
|
188 |
python generate_ligands.py last.ckpt --pdbfile 1abc.pdb --outfile results/1abc_mols.sdf --resi_list A:1 A:2 A:3 A:4 A:5 A:6 A:7 |
|
|
189 |
``` |
|
|
190 |
Alternatively, the binding pocket can also be specified based on a reference ligand in the same PDB file: |
|
|
191 |
```bash |
|
|
192 |
python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outfile <output_file> --ref_ligand <chain>:<resi> |
|
|
193 |
``` |
|
|
194 |
or with a separate SDF file: |
|
|
195 |
```bash |
|
|
196 |
python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outfile <output_file> --ref_ligand <ref_ligand>.sdf |
|
|
197 |
``` |
|
|
198 |
|
|
|
199 |
Optional flags: |
|
|
200 |
| Flag | Description | |
|
|
201 |
|------|-------------| |
|
|
202 |
| `--n_samples` | Number of sampled molecules | |
|
|
203 |
| `--num_nodes_lig` | Size of sampled molecules | |
|
|
204 |
| `--timesteps` | Number of denoising steps for inference | |
|
|
205 |
| `--all_frags` | Keep all disconnected fragments | |
|
|
206 |
| `--sanitize` | Sanitize molecules (invalid molecules will be removed if this flag is present) | |
|
|
207 |
| `--relax` | Relax generated structure in force field (does not consider the protein and might introduce clashes) | |
|
|
208 |
| `--resamplings` | Inpainting parameter (doesn't apply if conditional model is used) | |
|
|
209 |
| `--jump_length` | Inpainting parameter (doesn't apply if conditional model is used) | |
|
|
210 |
|
|
|
211 |
### Sample molecules for all pockets in the test set |
|
|
212 |
`test.py` can be used to sample molecules for the entire testing set: |
|
|
213 |
```bash |
|
|
214 |
python test.py <checkpoint>.ckpt --test_dir <bindingmoad_dir>/processed_noH/test/ --outdir <output_dir> --sanitize |
|
|
215 |
``` |
|
|
216 |
There are different ways to determine the size of sampled molecules. |
|
|
217 |
- `--fix_n_nodes`: generates ligands with the same number of nodes as the reference molecule |
|
|
218 |
- `--n_nodes_bias <int>`: samples the number of nodes randomly and adds this bias |
|
|
219 |
- `--n_nodes_min <int>`: samples the number of nodes randomly but clamps it at this value |
|
|
220 |
|
|
|
221 |
Other optional flags are analogous to `generate_ligands.py`. |
|
|
222 |
|
|
|
223 |
### Fix substructures |
|
|
224 |
`inpaint.py` can be used for partial ligand redesign with the conditionally trained model, e.g.: |
|
|
225 |
```bash |
|
|
226 |
python inpaint.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outfile <output_file> --ref_ligand <chain>:<resi> --fix_atoms C1 N6 C5 C12 |
|
|
227 |
``` |
|
|
228 |
`--add_n_nodes` controls the number of newly generated nodes. Other options are the same as before. |
|
|
229 |
|
|
|
230 |
### Metrics |
|
|
231 |
For assessing basic molecular properties create an instance of the `MoleculeProperties` class and run its `evaluate` method: |
|
|
232 |
```python |
|
|
233 |
from analysis.metrics import MoleculeProperties |
|
|
234 |
mol_metrics = MoleculeProperties() |
|
|
235 |
all_qed, all_sa, all_logp, all_lipinski, per_pocket_diversity = \ |
|
|
236 |
mol_metrics.evaluate(pocket_mols) |
|
|
237 |
``` |
|
|
238 |
`evaluate()` expects a list of lists where the inner list contains all RDKit molecules generated for one pocket. |
|
|
239 |
|
|
|
240 |
## Citation |
|
|
241 |
``` |
|
|
242 |
@article{schneuing2024diffsbdd, |
|
|
243 |
title={Structure-based drug design with equivariant diffusion models}, |
|
|
244 |
author={Schneuing, Arne and Harris, Charles and Du, Yuanqi and Didi, Kieran and Jamasb, Arian and Igashov, Ilia and Du, Weitao and Gomes, Carla and Blundell, Tom L and Lio, Pietro and Welling, Max and Bronstein, Michael and Correia, Bruno}, |
|
|
245 |
journal={Nature Computational Science}, |
|
|
246 |
year={2024}, |
|
|
247 |
month={Dec}, |
|
|
248 |
day={01}, |
|
|
249 |
volume={4}, |
|
|
250 |
number={12}, |
|
|
251 |
pages={899-909}, |
|
|
252 |
issn={2662-8457}, |
|
|
253 |
doi={10.1038/s43588-024-00737-x}, |
|
|
254 |
url={https://doi.org/10.1038/s43588-024-00737-x} |
|
|
255 |
} |
|
|
256 |
``` |