DrugCLIP / Git / Diff of /README.md

Models:

Amanda-D/

DrugCLIP

Downloads: 1

Diff of /README.md [000000] .. [b40915]

Switch to unified view

 b/README.md
+# DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/xxxx/blob/main/LICENSE)
+[![ArXiv](http://img.shields.io/badge/cs.LG-arXiv%3A2310.06367-B31B1B.svg)](https://arxiv.org/pdf/2310.06367.pdf)
+<!-- [[Code](xxxx - Overview)] -->
+![cover](framework.png)
+Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at *Neural Information Processing Systems, 2023*. **Currently the code is a raw version, will be updated ASAP**. If you have any inquiries, feel free to contact billgao0111@gmail.com
+# Requirements
+same as [Uni-Mol](https://github.com/dptech-corp/Uni-Mol/tree/main/unimol)
+**rdkit version should be 2022.9.5**
+## Data and checkpoints
+https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing
+It currently includes the train data, the trained checkpoint and the test data for DUD-E
+### Training data
+The dataset for training is included in google drive: train_no_test_af.zip. It contains several files:
+```
+dick_pkt.txt: dictionary for pocket atom types
+dict_mol.txt: dictionary for molecule atom types
+train.lmdb: train dataset
+valid.lmdb: validation dataset
+```
+Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below:
+```
+"atoms": "atom types for each atom in the ligand"
+"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10"
+"pocket_atoms": "atom types for each atom in the pocket"
+"pocket_coordinates": "3D coordinates for each atom in the pocket"
+"mol": "RDKit molecule object for the ligand"
+"smi": "SMILES string for the ligand"
+"pocket": "pdbid of the pocket",
+```
+The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations.
+### Test data
+#### DUD-E
+```
+DUD-E
+├── gene id
+│   ├── receptor.pdb
+│   ├── crystal_ligand.mol2
+│   ├── actives_final.ism
+│   ├── decoys_final.ism
+│   ├── mols.lmdb (containing all actives and decoys)
+│   ├── pocket.lmdb
+```
+#### PCBA
+```
+lit_pcba
+├── target name
+│   ├── PDBID_protein.mol2
+│   ├── PDBID_ligand.mol2
+│   ├── actives.smi
+│   ├── inactives.smi
+│   ├── mols.lmdb (containing all actives and inactives)
+│   ├── pocket.lmdb
+```
+### Data preprocessing
+see py_scripts/write_dude_multi.py
+## HomoAug
+Please refer to HomoAug directory for details
+## Train
+bash drugclip.sh
+## Test
+bash test.sh
+## Retrieval
+bash retrieval.sh
+In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir.
+## Citation
+If you find our work useful, please cite our paper:
+```bibtex
+@inproceedings{gao2023drugclip,
+    author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan},
+    title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening},
+    booktitle = {NeurIPS 2023},
+    year = {2023},
+    url = {https://openreview.net/forum?id=lAbCgNcxm7},
+}
+```