Switch to unified view

a/README.md b/README.md
1
# DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
1
# DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
2
2
3
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/xxxx/blob/main/LICENSE)
3
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/xxxx/blob/main/LICENSE)
4
[![ArXiv](http://img.shields.io/badge/cs.LG-arXiv%3A2310.06367-B31B1B.svg)](https://arxiv.org/pdf/2310.06367.pdf)
4
[![ArXiv](http://img.shields.io/badge/cs.LG-arXiv%3A2310.06367-B31B1B.svg)](https://arxiv.org/pdf/2310.06367.pdf)
5
5
6
<!-- [[Code](xxxx - Overview)] -->
6
<!-- [[Code](xxxx - Overview)] -->
7
7
8
![cover](framework.png)
8
![cover](https://github.com/bowen-gao/DrugCLIP/blob/main/framework.png?raw=true)
9
9
10
Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at *Neural Information Processing Systems, 2023*. **Currently the code is a raw version, will be updated ASAP**. If you have any inquiries, feel free to contact billgao0111@gmail.com
10
Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at *Neural Information Processing Systems, 2023*. **Currently the code is a raw version, will be updated ASAP**. If you have any inquiries, feel free to contact billgao0111@gmail.com
11
11
12
# Requirements
12
# Requirements
13
13
14
same as [Uni-Mol](https://github.com/dptech-corp/Uni-Mol/tree/main/unimol)
14
same as [Uni-Mol](https://github.com/dptech-corp/Uni-Mol/tree/main/unimol)
15
15
16
**rdkit version should be 2022.9.5**
16
**rdkit version should be 2022.9.5**
17
17
18
## Data and checkpoints
18
## Data and checkpoints
19
19
20
https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing
20
https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing
21
21
22
It currently includes the train data, the trained checkpoint and the test data for DUD-E
22
It currently includes the train data, the trained checkpoint and the test data for DUD-E
23
23
24
24
25
25
26
### Training data
26
### Training data
27
27
28
The dataset for training is included in google drive: train_no_test_af.zip. It contains several files:
28
The dataset for training is included in google drive: train_no_test_af.zip. It contains several files:
29
29
30
```
30
```
31
31
32
dick_pkt.txt: dictionary for pocket atom types
32
dick_pkt.txt: dictionary for pocket atom types
33
33
34
dict_mol.txt: dictionary for molecule atom types
34
dict_mol.txt: dictionary for molecule atom types
35
35
36
train.lmdb: train dataset
36
train.lmdb: train dataset
37
37
38
valid.lmdb: validation dataset
38
valid.lmdb: validation dataset
39
39
40
```
40
```
41
41
42
Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below:
42
Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below:
43
43
44
```
44
```
45
45
46
"atoms": "atom types for each atom in the ligand" 
46
"atoms": "atom types for each atom in the ligand" 
47
47
48
"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10"
48
"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10"
49
49
50
"pocket_atoms": "atom types for each atom in the pocket"
50
"pocket_atoms": "atom types for each atom in the pocket"
51
51
52
"pocket_coordinates": "3D coordinates for each atom in the pocket"
52
"pocket_coordinates": "3D coordinates for each atom in the pocket"
53
53
54
"mol": "RDKit molecule object for the ligand"
54
"mol": "RDKit molecule object for the ligand"
55
55
56
"smi": "SMILES string for the ligand"
56
"smi": "SMILES string for the ligand"
57
57
58
"pocket": "pdbid of the pocket",
58
"pocket": "pdbid of the pocket",
59
```
59
```
60
60
61
61
62
The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations.
62
The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations.
63
63
64
64
65
### Test data
65
### Test data
66
66
67
#### DUD-E
67
#### DUD-E
68
68
69
```
69
```
70
DUD-E
70
DUD-E
71
├── gene id
71
├── gene id
72
│   ├── receptor.pdb
72
│   ├── receptor.pdb
73
│   ├── crystal_ligand.mol2
73
│   ├── crystal_ligand.mol2
74
│   ├── actives_final.ism
74
│   ├── actives_final.ism
75
│   ├── decoys_final.ism
75
│   ├── decoys_final.ism
76
│   ├── mols.lmdb (containing all actives and decoys)
76
│   ├── mols.lmdb (containing all actives and decoys)
77
│   ├── pocket.lmdb
77
│   ├── pocket.lmdb
78
78
79
```
79
```
80
80
81
#### PCBA
81
#### PCBA
82
82
83
```
83
```
84
lit_pcba
84
lit_pcba
85
├── target name
85
├── target name
86
│   ├── PDBID_protein.mol2
86
│   ├── PDBID_protein.mol2
87
│   ├── PDBID_ligand.mol2
87
│   ├── PDBID_ligand.mol2
88
│   ├── actives.smi
88
│   ├── actives.smi
89
│   ├── inactives.smi
89
│   ├── inactives.smi
90
│   ├── mols.lmdb (containing all actives and inactives)
90
│   ├── mols.lmdb (containing all actives and inactives)
91
│   ├── pocket.lmdb
91
│   ├── pocket.lmdb
92
92
93
```
93
```
94
94
95
95
96
### Data preprocessing
96
### Data preprocessing
97
97
98
see py_scripts/write_dude_multi.py
98
see py_scripts/write_dude_multi.py
99
99
100
## HomoAug
100
## HomoAug
101
101
102
Please refer to HomoAug directory for details
102
Please refer to HomoAug directory for details
103
103
104
## Train
104
## Train
105
105
106
bash drugclip.sh
106
bash drugclip.sh
107
107
108
## Test
108
## Test
109
109
110
bash test.sh
110
bash test.sh
111
111
112
112
113
## Retrieval 
113
## Retrieval 
114
114
115
bash retrieval.sh
115
bash retrieval.sh
116
116
117
In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir.
117
In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir.
118
118
119
119
120
## Citation
120
## Citation
121
121
122
If you find our work useful, please cite our paper:
122
If you find our work useful, please cite our paper:
123
123
124
```bibtex
124
```bibtex
125
@inproceedings{gao2023drugclip,
125
@inproceedings{gao2023drugclip,
126
    author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan},
126
    author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan},
127
    title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening},
127
    title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening},
128
    booktitle = {NeurIPS 2023},
128
    booktitle = {NeurIPS 2023},
129
    year = {2023},
129
    year = {2023},
130
    url = {https://openreview.net/forum?id=lAbCgNcxm7},
130
    url = {https://openreview.net/forum?id=lAbCgNcxm7},
131
}
131
}
132
```
132
```