|
a/README.md |
|
b/README.md |
1 |
# DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening |
1 |
# DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening |
2 |
|
2 |
|
3 |
[](https://github.com/xxxx/blob/main/LICENSE) |
3 |
[](https://github.com/xxxx/blob/main/LICENSE)
|
4 |
[](https://arxiv.org/pdf/2310.06367.pdf) |
4 |
[](https://arxiv.org/pdf/2310.06367.pdf) |
5 |
|
5 |
|
6 |
<!-- [[Code](xxxx - Overview)] --> |
6 |
<!-- [[Code](xxxx - Overview)] --> |
7 |
|
7 |
|
8 |
 |
8 |
 |
9 |
|
9 |
|
10 |
Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at *Neural Information Processing Systems, 2023*. **Currently the code is a raw version, will be updated ASAP**. If you have any inquiries, feel free to contact billgao0111@gmail.com |
10 |
Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at *Neural Information Processing Systems, 2023*. **Currently the code is a raw version, will be updated ASAP**. If you have any inquiries, feel free to contact billgao0111@gmail.com |
11 |
|
11 |
|
12 |
# Requirements |
12 |
# Requirements |
13 |
|
13 |
|
14 |
same as [Uni-Mol](https://github.com/dptech-corp/Uni-Mol/tree/main/unimol) |
14 |
same as [Uni-Mol](https://github.com/dptech-corp/Uni-Mol/tree/main/unimol) |
15 |
|
15 |
|
16 |
**rdkit version should be 2022.9.5** |
16 |
**rdkit version should be 2022.9.5** |
17 |
|
17 |
|
18 |
## Data and checkpoints |
18 |
## Data and checkpoints |
19 |
|
19 |
|
20 |
https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing |
20 |
https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing |
21 |
|
21 |
|
22 |
It currently includes the train data, the trained checkpoint and the test data for DUD-E |
22 |
It currently includes the train data, the trained checkpoint and the test data for DUD-E |
23 |
|
23 |
|
24 |
|
24 |
|
25 |
|
25 |
|
26 |
### Training data |
26 |
### Training data |
27 |
|
27 |
|
28 |
The dataset for training is included in google drive: train_no_test_af.zip. It contains several files: |
28 |
The dataset for training is included in google drive: train_no_test_af.zip. It contains several files: |
29 |
|
29 |
|
30 |
``` |
30 |
``` |
31 |
|
31 |
|
32 |
dick_pkt.txt: dictionary for pocket atom types |
32 |
dick_pkt.txt: dictionary for pocket atom types |
33 |
|
33 |
|
34 |
dict_mol.txt: dictionary for molecule atom types |
34 |
dict_mol.txt: dictionary for molecule atom types |
35 |
|
35 |
|
36 |
train.lmdb: train dataset |
36 |
train.lmdb: train dataset |
37 |
|
37 |
|
38 |
valid.lmdb: validation dataset |
38 |
valid.lmdb: validation dataset |
39 |
|
39 |
|
40 |
``` |
40 |
``` |
41 |
|
41 |
|
42 |
Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below: |
42 |
Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below: |
43 |
|
43 |
|
44 |
``` |
44 |
``` |
45 |
|
45 |
|
46 |
"atoms": "atom types for each atom in the ligand" |
46 |
"atoms": "atom types for each atom in the ligand" |
47 |
|
47 |
|
48 |
"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10" |
48 |
"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10" |
49 |
|
49 |
|
50 |
"pocket_atoms": "atom types for each atom in the pocket" |
50 |
"pocket_atoms": "atom types for each atom in the pocket" |
51 |
|
51 |
|
52 |
"pocket_coordinates": "3D coordinates for each atom in the pocket" |
52 |
"pocket_coordinates": "3D coordinates for each atom in the pocket" |
53 |
|
53 |
|
54 |
"mol": "RDKit molecule object for the ligand" |
54 |
"mol": "RDKit molecule object for the ligand" |
55 |
|
55 |
|
56 |
"smi": "SMILES string for the ligand" |
56 |
"smi": "SMILES string for the ligand" |
57 |
|
57 |
|
58 |
"pocket": "pdbid of the pocket", |
58 |
"pocket": "pdbid of the pocket",
|
59 |
``` |
59 |
``` |
60 |
|
60 |
|
61 |
|
61 |
|
62 |
The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations. |
62 |
The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations. |
63 |
|
63 |
|
64 |
|
64 |
|
65 |
### Test data |
65 |
### Test data |
66 |
|
66 |
|
67 |
#### DUD-E |
67 |
#### DUD-E |
68 |
|
68 |
|
69 |
``` |
69 |
```
|
70 |
DUD-E |
70 |
DUD-E
|
71 |
├── gene id |
71 |
├── gene id
|
72 |
│ ├── receptor.pdb |
72 |
│ ├── receptor.pdb
|
73 |
│ ├── crystal_ligand.mol2 |
73 |
│ ├── crystal_ligand.mol2
|
74 |
│ ├── actives_final.ism |
74 |
│ ├── actives_final.ism
|
75 |
│ ├── decoys_final.ism |
75 |
│ ├── decoys_final.ism
|
76 |
│ ├── mols.lmdb (containing all actives and decoys) |
76 |
│ ├── mols.lmdb (containing all actives and decoys)
|
77 |
│ ├── pocket.lmdb |
77 |
│ ├── pocket.lmdb |
78 |
|
78 |
|
79 |
``` |
79 |
``` |
80 |
|
80 |
|
81 |
#### PCBA |
81 |
#### PCBA |
82 |
|
82 |
|
83 |
``` |
83 |
```
|
84 |
lit_pcba |
84 |
lit_pcba
|
85 |
├── target name |
85 |
├── target name
|
86 |
│ ├── PDBID_protein.mol2 |
86 |
│ ├── PDBID_protein.mol2
|
87 |
│ ├── PDBID_ligand.mol2 |
87 |
│ ├── PDBID_ligand.mol2
|
88 |
│ ├── actives.smi |
88 |
│ ├── actives.smi
|
89 |
│ ├── inactives.smi |
89 |
│ ├── inactives.smi
|
90 |
│ ├── mols.lmdb (containing all actives and inactives) |
90 |
│ ├── mols.lmdb (containing all actives and inactives)
|
91 |
│ ├── pocket.lmdb |
91 |
│ ├── pocket.lmdb |
92 |
|
92 |
|
93 |
``` |
93 |
``` |
94 |
|
94 |
|
95 |
|
95 |
|
96 |
### Data preprocessing |
96 |
### Data preprocessing |
97 |
|
97 |
|
98 |
see py_scripts/write_dude_multi.py |
98 |
see py_scripts/write_dude_multi.py |
99 |
|
99 |
|
100 |
## HomoAug |
100 |
## HomoAug |
101 |
|
101 |
|
102 |
Please refer to HomoAug directory for details |
102 |
Please refer to HomoAug directory for details |
103 |
|
103 |
|
104 |
## Train |
104 |
## Train |
105 |
|
105 |
|
106 |
bash drugclip.sh |
106 |
bash drugclip.sh |
107 |
|
107 |
|
108 |
## Test |
108 |
## Test |
109 |
|
109 |
|
110 |
bash test.sh |
110 |
bash test.sh |
111 |
|
111 |
|
112 |
|
112 |
|
113 |
## Retrieval |
113 |
## Retrieval |
114 |
|
114 |
|
115 |
bash retrieval.sh |
115 |
bash retrieval.sh |
116 |
|
116 |
|
117 |
In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir. |
117 |
In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir. |
118 |
|
118 |
|
119 |
|
119 |
|
120 |
## Citation |
120 |
## Citation |
121 |
|
121 |
|
122 |
If you find our work useful, please cite our paper: |
122 |
If you find our work useful, please cite our paper: |
123 |
|
123 |
|
124 |
```bibtex |
124 |
```bibtex
|
125 |
@inproceedings{gao2023drugclip, |
125 |
@inproceedings{gao2023drugclip,
|
126 |
author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan}, |
126 |
author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan},
|
127 |
title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening}, |
127 |
title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening},
|
128 |
booktitle = {NeurIPS 2023}, |
128 |
booktitle = {NeurIPS 2023},
|
129 |
year = {2023}, |
129 |
year = {2023},
|
130 |
url = {https://openreview.net/forum?id=lAbCgNcxm7}, |
130 |
url = {https://openreview.net/forum?id=lAbCgNcxm7},
|
131 |
} |
131 |
}
|
132 |
``` |
132 |
```
|