Diff of /README.md [000000] .. [b40915]

Switch to unified view

a b/README.md
1
# DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
2
3
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/xxxx/blob/main/LICENSE)
4
[![ArXiv](http://img.shields.io/badge/cs.LG-arXiv%3A2310.06367-B31B1B.svg)](https://arxiv.org/pdf/2310.06367.pdf)
5
6
<!-- [[Code](xxxx - Overview)] -->
7
8
![cover](framework.png)
9
10
Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at *Neural Information Processing Systems, 2023*. **Currently the code is a raw version, will be updated ASAP**. If you have any inquiries, feel free to contact billgao0111@gmail.com
11
12
# Requirements
13
14
same as [Uni-Mol](https://github.com/dptech-corp/Uni-Mol/tree/main/unimol)
15
16
**rdkit version should be 2022.9.5**
17
18
## Data and checkpoints
19
20
https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing
21
22
It currently includes the train data, the trained checkpoint and the test data for DUD-E
23
24
25
26
### Training data
27
28
The dataset for training is included in google drive: train_no_test_af.zip. It contains several files:
29
30
```
31
32
dick_pkt.txt: dictionary for pocket atom types
33
34
dict_mol.txt: dictionary for molecule atom types
35
36
train.lmdb: train dataset
37
38
valid.lmdb: validation dataset
39
40
```
41
42
Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below:
43
44
```
45
46
"atoms": "atom types for each atom in the ligand" 
47
48
"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10"
49
50
"pocket_atoms": "atom types for each atom in the pocket"
51
52
"pocket_coordinates": "3D coordinates for each atom in the pocket"
53
54
"mol": "RDKit molecule object for the ligand"
55
56
"smi": "SMILES string for the ligand"
57
58
"pocket": "pdbid of the pocket",
59
```
60
61
62
The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations.
63
64
65
### Test data
66
67
#### DUD-E
68
69
```
70
DUD-E
71
├── gene id
72
│   ├── receptor.pdb
73
│   ├── crystal_ligand.mol2
74
│   ├── actives_final.ism
75
│   ├── decoys_final.ism
76
│   ├── mols.lmdb (containing all actives and decoys)
77
│   ├── pocket.lmdb
78
79
```
80
81
#### PCBA
82
83
```
84
lit_pcba
85
├── target name
86
│   ├── PDBID_protein.mol2
87
│   ├── PDBID_ligand.mol2
88
│   ├── actives.smi
89
│   ├── inactives.smi
90
│   ├── mols.lmdb (containing all actives and inactives)
91
│   ├── pocket.lmdb
92
93
```
94
95
96
### Data preprocessing
97
98
see py_scripts/write_dude_multi.py
99
100
## HomoAug
101
102
Please refer to HomoAug directory for details
103
104
## Train
105
106
bash drugclip.sh
107
108
## Test
109
110
bash test.sh
111
112
113
## Retrieval 
114
115
bash retrieval.sh
116
117
In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir.
118
119
120
## Citation
121
122
If you find our work useful, please cite our paper:
123
124
```bibtex
125
@inproceedings{gao2023drugclip,
126
    author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan},
127
    title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening},
128
    booktitle = {NeurIPS 2023},
129
    year = {2023},
130
    url = {https://openreview.net/forum?id=lAbCgNcxm7},
131
}
132
```