Diff of /HomoAug/README.md [000000] .. [b40915]

Switch to unified view

a b/HomoAug/README.md
1
# Code for HomoAug
2
3
Homoaug is a data augmentation method for generate pseudo-pocket-ligand pairs. It is based on the idea that the ligand binding sites of homologous proteins are similar. The details of HomoAug can be found in our paper: "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening"
4
5
## Requirements
6
7
To run the code, you need to install the following packages:
8
9
- ray (version 1.12.0)
10
- jackmmer (a part of HMMER)
11
- esl-reformat (a part of HMMER)
12
- TM-align 
13
14
Please use 'pip install ray==1.12.0' to install ray.
15
16
The exeauctable files of jackmmer, esl-reformat and TM-align are included in our folder 'bin', please add the path of 'bin' to your environment variable.
17
18
## Usage
19
20
<!---     parser = argparse.ArgumentParser()
21
    parser.add_argument("--id_file", type=str, default="/drug/BioLip/tmp.id")
22
    parser.add_argument("--homoaug_dir", type=str, default="/drug/BioLip/homoaug_new")
23
    parser.add_argument(
24
        "--fasta_file",
25
        type=str,
26
        default="/drug/BioLip/BioLiP_v2023-04-13_regularLigand.fasta",
27
    )
28
    parser.add_argument(
29
        "--protein_pdb_dir", type=str, default="/drug/BioLip/protein_pdb"
30
    )
31
    parser.add_argument(
32
        "--pocket_pdbs_dir", type=str, default="/drug/BioLip/pocket_pdb"
33
    )
34
    parser.add_argument(
35
        "--jackhmmer_output_dir", type=str, default="/drug/BioLip/pdbbind_MSA_fasta"
36
    )
37
    parser.add_argument(
38
        "--n_thread", type=int, default=10, help="number of threads for running"
39
    )
40
    parser.add_argument(
41
        "--database_fasta_path",
42
        type=str,
43
        default="/data/protein/AF2DB/AFDB_HC_50.fa",
44
        help="jackhmmer search database, in fasta format",
45
    )
46
    parser.add_argument(
47
        "--database_pdb_dir",
48
        type=str,
49
        default="/drug/AFDB_HC_50_PDB",
50
        help="homoaug search database, e.g. AF2DB",
51
    )
52
    parser.add_argument(
53
        "--max_extend_num",
54
        type=int,
55
        default=20,
56
        help="max number of extended pocket-ligand pairs for one real pocket-ligand pair",
57
    )
58
    parser.add_argument(
59
        "--TMscore_threshold",
60
        type=float,
61
        default=0.4,
62
        help="TMscore threshold for extending",
63
    )
64
    parser.add_argument(
65
        "--Match_rate_threshold",
66
        type=float,
67
        default=0.4,
68
        help="Match_rate threshold for extending",
69
    )
70
-->
71
72
73
To use HomoAug, you only need to run the following command:
74
75
```bash
76
    python run_HomoAug.py
77
        --id_file your_id_file
78
        --homoaug_dir your_homoaug_dir
79
        --fasta_file your_fasta_file
80
        --protein_pdb_dir your_protein_pdb_dir
81
        --pocket_pdbs_dir your_pocket_pdbs_dir
82
        --jackhmmer_output_dir your_jackhmmer_output_dir
83
        --n_thread your_n_thread
84
        --database_fasta_path your_database_fasta_path
85
        --database_pdb_dir your_database_pdb_dir
86
        --max_extend_num your_max_extend_num
87
        --TMscore_threshold your_TMscore_threshold
88
        --Match_rate_threshold your_Match_rate_threshold
89
```
90
91
We will explain the meaning of each parameter in the following.
92
93
94
#### --id_file
95
96
The id_file is a file containing the ids of the real pocket-ligand pairs. 
97
98
For example, the id_file of BioLip dataset can be like this:
99
100
```
101
2WNS_B_receptor_B_550_OMP
102
5RVK_A_receptor_A_201_2AK
103
5F03_A_receptor_A_301_5TA
104
7EXF_B_receptor_B_801_GAL
105
4OJ4_A_receptor_A_501_DIF
106
1PJ7_A_receptor_A_2887_FFO
107
......
108
```
109
110
You need to create your own id_file according to your dataset, with the format like :`<pdb_id>_<chain_id>_<any_string>_<ligand_name_in_pdb>
111
112
#### --homoaug_dir
113
114
The output directory of HomoAug. When the program is finished, you will find the extended pocket-ligand pairs in this directory.
115
116
The structure of homoaug_dir is like this:
117
118
```
119
homoaug_dir
120
├── PDBID
121
│   ├── PDBID.fasta
122
│   ├── PDBID.pdb
123
│   ├── PDBID_pocket.pdb
124
│   ├── PDBID_pocket_chain.pdb
125
│   ├── PDBID_ligand.pdb
126
│   ├── PDBID_pocket_position.txt
127
│   ├── rotation_matrix
128
│   │   ├── (several TMalign output files)
129
│   ├── *extend*
130
│       ├──AugmentedPDBID1
131
│       │   ├── AugmentedPDBID1_protein.pdb
132
│       │   ├── AugmentedPDBID1_pocket.pdb
133
│       ├── AugmentedPDBID2
134
│       │   ├── AugmentedPDBID2_protein.pdb
135
│       │   ├── AugmentedPDBID2_pocket.pdb
136
│       ......
137
│     
138
......
139
```
140
141
Each AugmentedPDBID in extend directory refers to an extended pocket-ligand pair.
142
143
#### --fasta_file
144
145
The fasta file crossponding to the id_file. Each title should have the pdbid at the beginning.
146
147
e.g.:
148
149
```
150
>1q20_A_O00204_A_PLO 
151
SDISEISQKLPGEYFRYK......
152
>1q21_A_P01112_A_GDP 
153
MTEYKLVVVGAGGVGKSA......
154
>1q22_A_O00204_A_A3P 
155
SDISEISQKLPGEYFRVP......
156
......
157
```
158
159
#### --protein_pdb_dir
160
161
CIF files of the input proteins, in cif format.
162
163
Name format: <pdb_id>.cif
164
165
#### --pocket_pdbs_dir
166
167
PDB files of the input pockets, in pdb format.
168
169
Name format: <pdb_id>.pdb
170
171
#### --jackhmmer_output_dir
172
173
The temporary directory for storing the output of jackhmmer.
174
175
#### --n_thread
176
177
Number of threads for running.
178
179
#### --database_fasta_path
180
181
The fasta file of the database used for augmentation database, e.g. AF2DB.
182
183
#### --database_pdb_dir
184
185
The pdb directory of the database used for augmentation database, e.g. AF2DB's pdb directory.
186
187
#### --max_extend_num
188
189
The max number of extended pocket-ligand pairs for one real pocket-ligand pair.
190
191
#### --TMscore_threshold and Match_rate_threshold
192
193
Only the extended pocket-ligand pairs with TMscore >= TMscore_threshold and Match_rate >= Match_rate_threshold will be kept.
194
195
Match_rate is the ratio of the number of matched residues to the number of residues in the real pocket.
196
197
## Note
198
199
For various error scenarios, we choose to skip the HomoAug for this pocket-ligand pair, including but not limited to cases where the pocket is composed of multiple chains. Please note that even if the program executes successfully, some files may be missing in certain output directories, indicating that this pocket-ligand pair is currently not suitable for HomoAug. In summary, all valid HomoAug results will be saved in the 'extend' folder.