Diff of /README.md [000000] .. [607087]

Switch to unified view

a b/README.md
1
# DiffSBDD: Structure-based Drug Design with Equivariant Diffusion Models
2
3
Official implementation of **DiffSBDD**, an equivariant diffusion model for structure-based drug design, by Arne Schneuing, Charles Harris, Yuanqi Du, Kieran Didi, Arian Jamasb, Ilia Igashov, Weitao Du, Carla Gomes, Tom Blundell, Pietro Lio, Max Welling, Michael Bronstein & Bruno Correia.
4
5
[![DOI](https://zenodo.org/badge/DOI/10.1038/s43588-024-00737-x.svg)](https://doi.org/10.1038/s43588-024-00737-x)
6
[![arXiv](https://img.shields.io/badge/arXiv-2210.13695-B31B1B.svg)](http://arxiv.org/abs/2210.13695)
7
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arneschneuing/DiffSBDD/blob/main/colab/DiffSBDD.ipynb)
8
9
> [!TIP]
10
> You can also try out our new 3D generative models for drug design at https://github.com/LPDI-EPFL/DrugFlow.
11
12
![](img/overview.png)
13
14
1. [Dependencies](#dependencies)
15
   1. [Conda environment](#conda-environment)
16
   3. [Pre-trained models](#pre-trained-models)
17
2. [Step-by-step examples](#step-by-step-examples)
18
   1. [De novo design](#de-novo-design)
19
   2. [Substructure inpainting](#substructure-inpainting)
20
   3. [Molecular optimization](#molecular-optimization)
21
3. [Benchmarks](#benchmarks)
22
   1. [CrossDocked Benchmark](#crossdocked)
23
   2. [Binding MOAD](#binding-moad)
24
   3. [Sampled molecules](#sampled-molecules)
25
4. [Training](#training)
26
5. [Inference](#inference)
27
   1. [Sample molecules for a given pocket](#sample-molecules-for-a-given-pocket)
28
   2. [Test set sampling](#sample-molecules-for-all-pockets-in-the-test-set)
29
   3. [Fix substructures](#fix-substructures)
30
   4. [Metrics](#metrics)
31
6. [Citation](#citation)
32
33
## Dependencies
34
35
### Conda environment
36
```bash
37
conda create -n sbdd-env
38
conda activate sbdd-env
39
conda install pytorch cudatoolkit=10.2 -c pytorch
40
conda install -c conda-forge pytorch-lightning
41
conda install -c conda-forge wandb
42
conda install -c conda-forge rdkit
43
conda install -c conda-forge biopython
44
conda install -c conda-forge imageio
45
conda install -c anaconda scipy
46
conda install -c pyg pytorch-scatter
47
conda install -c conda-forge openbabel
48
conda install seaborn
49
```
50
51
The code was tested with the following versions
52
| Software          | Version   |
53
|-------------------|-----------|
54
| Python            | 3.10.4    |
55
| CUDA              | 10.2.89   |
56
| PyTorch           | 1.12.1    |
57
| PyTorch Lightning | 1.7.4     |
58
| WandB             | 0.13.1    |
59
| RDKit             | 2022.03.2 |
60
| BioPython         | 1.79      |
61
| imageio           | 2.21.2    |
62
| SciPy             | 1.7.3     |
63
| PyTorch Scatter   | 2.0.9     |
64
| OpenBabel         | 3.1.1     |
65
66
### Pre-trained models
67
Pre-trained models can be downloaded from [Zenodo](https://zenodo.org/record/8183747).
68
- [CrossDocked, conditional $`C_\alpha`$ model](https://zenodo.org/record/8183747/files/crossdocked_ca_cond.ckpt?download=1)
69
- [CrossDocked, joint $`C_\alpha`$ model](https://zenodo.org/record/8183747/files/crossdocked_ca_joint.ckpt?download=1)
70
- [CrossDocked, conditional full-atom model](https://zenodo.org/record/8183747/files/crossdocked_fullatom_cond.ckpt?download=1)
71
- [CrossDocked, joint full-atom model](https://zenodo.org/record/8183747/files/crossdocked_fullatom_joint.ckpt?download=1)
72
- [Binding MOAD, conditional $`C_\alpha`$ model](https://zenodo.org/record/8183747/files/moad_ca_cond.ckpt?download=1)
73
- [Binding MOAD, joint $`C_\alpha`$ model](https://zenodo.org/record/8183747/files/moad_ca_joint.ckpt?download=1)
74
- [Binding MOAD, conditional full-atom model](https://zenodo.org/record/8183747/files/moad_fullatom_cond.ckpt?download=1)
75
- [Binding MOAD, joint full-atom model](https://zenodo.org/record/8183747/files/moad_fullatom_joint.ckpt?download=1)
76
77
## Step-by-step examples
78
79
These simple step-by-step examples provide an easy entry point to generating molecules with DiffSBDD.
80
More details about training and sampling scripts are provided below.
81
82
Before we run the sampling scripts we need to download a model checkpoint:
83
```bash
84
wget -P checkpoints/ https://zenodo.org/record/8183747/files/crossdocked_fullatom_cond.ckpt
85
```
86
It will be stored in the `./checkpoints` folder.
87
88
### De novo design
89
90
Using the trained model weights, we can sample new ligands with a single command. In this example, we use the protein with PDB ID `3RFM` that can be found in the example folder.
91
The PDB file contains a reference ligand in chain A at residue number 330 that we can use to specify the designated binding pocket.
92
The following command will generate 20 samples and save them in a file called `3rfm_mol.sdf` in the `./example` folder. 
93
```bash
94
python generate_ligands.py checkpoints/crossdocked_fullatom_cond.ckpt --pdbfile example/3rfm.pdb --outfile example/3rfm_mol.sdf --ref_ligand A:330 --n_samples 20
95
```
96
Instead of specifying the chain and residue number we can also provide an SDF file with the reference ligand:
97
```bash
98
python generate_ligands.py checkpoints/crossdocked_fullatom_cond.ckpt --pdbfile example/3rfm.pdb --outfile example/3rfm_mol.sdf --ref_ligand example/3rfm_B_CFF.sdf --n_samples 20
99
```
100
If no reference ligand is known, the binding pocket can also be specified as a list of residues as described [below](#sample-molecules-for-a-given-pocket).
101
102
### Substructure inpainting
103
104
To design molecules around fixed substructures (scaffold elaboration, fragment linking etc.) you can run the `inpaint.py` script.
105
Here, we demonstrate its usage with a fragment linking example. Similar to `generate_ligands.py`, the inpainting script allows us to define pockets based on a reference ligand in SDF format
106
or with a chain and residue identifier (if it is in the PDB).
107
The easiest way to fix substructures is to provide them in a separate SDF file using the `--fix_atoms` flag.
108
However, the script also accepts a list of atom names which must correspond to the atoms of the reference ligand in the PDB file, e.g. `--fix_atoms C1 N6 C5 C12`.
109
```bash 
110
python inpaint.py checkpoints/crossdocked_fullatom_cond.ckpt --pdbfile example/5ndu.pdb --outfile example/5ndu_linked_mols.sdf --ref_ligand example/5ndu_C_8V2.sdf --fix_atoms example/fragments.sdf --center ligand --add_n_nodes 10
111
```
112
Note that the `--center ligand` option tells DiffSBDD to sample the additional atoms near the center of mass of the fixed substructure, which is not always ideal or desired.
113
For instance, the inputs could be two fragments with very different sizes, in which case the random noise will be sampled very close to the larger fragment.
114
We currently also support sampling in the pocket center (`--center pocket`) but in some cases neither of these two options might be suitable and a problem-specific solution is warranted to avoid bad results.  
115
116
Another important parameter is `--add_n_nodes` which determines how many new atoms will be added. If it is not provided, a random number will be sampled.
117
118
### Molecular optimization
119
120
You can use DiffSBDD to optimize existing molecules for given properties via the `optimize.py` script.
121
122
```bash 
123
python optimize.py --checkpoint checkpoints/crossdocked_fullatom_cond.ckpt --pdbfile example/5ndu.pdb --outfile output.sdf --ref_ligand example/5ndu_C_8V2.sdf --objective sa --population_size 100 --evolution_steps 10 --top_k 10 --timesteps 100
124
```
125
126
Important parameters in the evolutionary algorithum are:
127
- `--checkpoint`: The checkpoint to use for the noising-denoising model.
128
- `--objective`: The optimization objective. Currently supports 'qed' for Quantitative Estimate of Drug-likeness and 'sa' for Synthetic Accessibility. Custom objectives can be implemented within the code.
129
- `--population_size`: The size of the molecule population to maintain across the optimization generations.
130
- `--evolution_steps`: The number of evolutionary steps (generations) to perform during the optimization process.
131
- `--top_k`: The number of top-scoring molecules to select from one generation to the next.
132
- `--timesteps`: The number of noise-denoise steps to use in the optimization algorithum. Defaults to 100 (out of T=500).
133
134
135
136
137
## Benchmarks
138
### CrossDocked
139
140
#### Data preparation
141
Download and extract the dataset as described by the authors of Pocket2Mol: https://github.com/pengxingang/Pocket2Mol/tree/main/data
142
143
Process the raw data using
144
```bash
145
python process_crossdock.py <crossdocked_dir> --no_H
146
```
147
148
### Binding MOAD
149
#### Data preparation
150
Download the dataset
151
```bash
152
wget http://www.bindingmoad.org/files/biou/every_part_a.zip
153
wget http://www.bindingmoad.org/files/biou/every_part_b.zip
154
wget http://www.bindingmoad.org/files/csv/every.csv
155
156
unzip every_part_a.zip
157
unzip every_part_b.zip
158
```
159
Process the raw data using
160
``` bash
161
python -W ignore process_bindingmoad.py <bindingmoad_dir>
162
```
163
Add the `--ca_only` flag to create a dataset with $C_\alpha$ pocket representation.
164
165
### Sampled molecules
166
Sampled molecules can be found on [Zenodo](https://zenodo.org/record/8239058).
167
168
## Training
169
Starting a new training run:
170
```bash
171
python -u train.py --config <config>.yml
172
```
173
174
Resuming a previous run:
175
```bash
176
python -u train.py --config <config>.yml --resume <checkpoint>.ckpt
177
```
178
179
## Inference
180
181
### Sample molecules for a given pocket
182
To sample small molecules for a given pocket with a trained model use the following command:
183
```bash
184
python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outfile <output_file> --resi_list <list_of_pocket_residue_ids>
185
```
186
For example:
187
```bash
188
python generate_ligands.py last.ckpt --pdbfile 1abc.pdb --outfile results/1abc_mols.sdf --resi_list A:1 A:2 A:3 A:4 A:5 A:6 A:7 
189
```
190
Alternatively, the binding pocket can also be specified based on a reference ligand in the same PDB file:
191
```bash 
192
python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outfile <output_file> --ref_ligand <chain>:<resi>
193
```
194
or with a separate SDF file:
195
```bash 
196
python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outfile <output_file> --ref_ligand <ref_ligand>.sdf
197
```
198
199
Optional flags:
200
| Flag | Description |
201
|------|-------------|
202
| `--n_samples` | Number of sampled molecules |
203
| `--num_nodes_lig` | Size of sampled molecules |
204
| `--timesteps` | Number of denoising steps for inference |
205
| `--all_frags` | Keep all disconnected fragments |
206
| `--sanitize` | Sanitize molecules (invalid molecules will be removed if this flag is present) |
207
| `--relax` | Relax generated structure in force field (does not consider the protein and might introduce clashes) |
208
| `--resamplings` | Inpainting parameter (doesn't apply if conditional model is used) |
209
| `--jump_length` | Inpainting parameter (doesn't apply if conditional model is used) |
210
211
### Sample molecules for all pockets in the test set
212
`test.py` can be used to sample molecules for the entire testing set:
213
```bash
214
python test.py <checkpoint>.ckpt --test_dir <bindingmoad_dir>/processed_noH/test/ --outdir <output_dir> --sanitize
215
```
216
There are different ways to determine the size of sampled molecules. 
217
- `--fix_n_nodes`: generates ligands with the same number of nodes as the reference molecule
218
- `--n_nodes_bias <int>`: samples the number of nodes randomly and adds this bias
219
- `--n_nodes_min <int>`: samples the number of nodes randomly but clamps it at this value
220
221
Other optional flags are analogous to `generate_ligands.py`. 
222
223
### Fix substructures
224
`inpaint.py` can be used for partial ligand redesign with the conditionally trained model, e.g.:
225
```bash 
226
python inpaint.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outfile <output_file> --ref_ligand <chain>:<resi> --fix_atoms C1 N6 C5 C12
227
```
228
`--add_n_nodes` controls the number of newly generated nodes. Other options are the same as before.
229
230
### Metrics
231
For assessing basic molecular properties create an instance of the `MoleculeProperties` class and run its `evaluate` method:
232
```python
233
from analysis.metrics import MoleculeProperties
234
mol_metrics = MoleculeProperties()
235
all_qed, all_sa, all_logp, all_lipinski, per_pocket_diversity = \
236
    mol_metrics.evaluate(pocket_mols)
237
```
238
`evaluate()` expects a list of lists where the inner list contains all RDKit molecules generated for one pocket.
239
240
## Citation
241
```
242
@article{schneuing2024diffsbdd,
243
   title={Structure-based drug design with equivariant diffusion models},
244
   author={Schneuing, Arne and Harris, Charles and Du, Yuanqi and Didi, Kieran and Jamasb, Arian and Igashov, Ilia and Du, Weitao and Gomes, Carla and Blundell, Tom L and Lio, Pietro and Welling, Max and Bronstein, Michael and Correia, Bruno},
245
   journal={Nature Computational Science},
246
   year={2024},
247
   month={Dec},
248
   day={01},
249
   volume={4},
250
   number={12},
251
   pages={899-909},
252
   issn={2662-8457},
253
   doi={10.1038/s43588-024-00737-x},
254
   url={https://doi.org/10.1038/s43588-024-00737-x}
255
}
256
```